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What This Book Can and Cannot 
Tell You 


Because PowerPC processors are being used in computers with different system architectures 
(with some of these running multiple operating systems), the scope of this book is limited to 
the processors themselves. This book will show you how these processors work, so that you can 
get the maximum performance out of your PowerPC processor based computer. And even 
though this book will primarily cover assembly language programming, you will be able to use 
this knowledge to optimize your programs, even when you are working in a higher-level lan- 


guage, such as C. 


The purpose of this book is not to describe interfaces at the system level, nor is it a prescription 
for how these interfaces should be implemented in hardware or software. Rather, it is intended 
as a guide (possibly even for the developer of these interfaces or systems) to explain how PowerPC 
microprocessors work, and to provide enough information so that developers and users of these 
interfaces can most effectively get these jobs done. 


Part 1 of this book is an introduction to the PowerPC architecture. Chapter 1, “PowerPC 
Concepts,” explains what the PowerPC concept is, where it came from, and the ideas behind 
the PowerPC processor architecture. Chapter 2, “Introduction to the PowerPC Architecture,” 
provides a more detailed view of the architecture, and provides a high-level view of the PowerPC 
machine environment; the types and number of machine registers, the kinds of instructions 
available, and the programming constructs the machine was designed to most efficiently sup- 
port. 


Chapters 3, 4, and 5 form Part 2 and give a description of the instruction set. These chapters 
provide a detailed reference—they are the place to look to find a precise description of the 
behavior of each PowerPC instruction. 


Finally, Part 3 discusses assembly programming on the PowerPC architecture. Chapter 7, “Cod- 
ing Strategy and Tuning for Performance,” covers a series of issues one step removed from the 
machine architecture. These include the conventions that are adopted to help structure and 
organize code sequences into subroutines, programs, and libraries (such as call-return linkage 
conventions, register save and restore requirements, and interface between assembly language 
and compiler-generated code). 


Chapter 7 is more about code strategy than it is about the actual nuts and bolts. It describes in 
detail how the designers of the PowerPC architecture intended it to be used. This chapter will 
help you understand the difference between PowerPC code that works, and PowerPC code that 
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works very well. Topics discussed include performance optimization (both general and proces- 
sor-specific), coding practices to avoid, platform independence, and seamless transition to 64- 


bit addressing. 


The appendixes provide programming examples, as well as a wealth of more detailed informa- 
tion that may not be of general interest, but nonetheless it provides comprehensive reference. 
Some of the topics covered include operating system memory management and address trans- 
lation, multiprocessor systems and shared memory, and details of floating point processing. 


About the Sample Programs 


Because of the range of PowerPC processor-based computer systems, operating systems, and 
associated development tools, the sample programs are as portable as possible. There are no 
specific examples of system device programming (such as timer chip programming), as these 
will vary with your computer system, and your operating system may prevent your accessing 
them. The examples also avoid heavy use of assembler pseudo-ops (commands to the assem- 
bler) as they may vary between specific assemblers. While we discuss calling assembly language 
routines from C and use of the TOC for XCOFF format object files, this is in the context of 
AIX, and the specifics may not apply to your operating system. 


Conventions Used in This Book 


The following typographic conventions are used in this book: 
M@ Code lines, commands, statements, variables, and any text you type or see on the 
screen appears in a computer typeface. 


M™@ Placeholders in syntax descriptions appear in an italic computer typeface. Replace 
the placeholder with the actual filename, parameter, or whatever element it represents. 

M@ Jtalics highlight technical terms when they first appear in the text and are sometimes 
used to emphasize important points. 


Pseudocode, a way of explaining in English what a program does, also appears in 
italics. 








This book is a field guide for developers of PowerPC applications and systems software. In re- 
searching this book, we found that much of currently available reference information on the 
PowerPC architecture is written from the processor designer’s, rather than the programmer’s, 
point of view. The designer wants to know how each feature works; the programmer wants to 
know how to use and combine those features most effectively to solve a problem. This is a 
programmer's book about PowerPC processors. 


With modern compilers and ever faster processors, it might seem that assembly language pro- 
grams would be going the way of relics like punched-paper tape, job control cards, and, well, 
floppy disks. The advantages of high-level languages—ease of development, maintainability, and 
portability—are compelling. And, indeed, most software development these days is done in a 


high-level language. 


Nonetheless—it remains that a significant body of the code you execute on your machine 
(PowerPC or not) each and every time you use it, was coded in assembly language. This includes 
parts of the operating system, ROM firmware, and run-time library code that is linked into vir- 
tually every application program. Why is so much code “still” written in assembler? In some cases, 
it’s because high-level languages don’t provide a complete interface to the hardware. They don’t 
provide a direct way to inspect and modify some of the special purpose machine registers, or to 
execute certain instructions. In these cases, there is no choice but to write some things in assem- 
bly language. 


Another reason to write assembler is performance. An old hacker’s truism states that a “real pro- 
grammer’ can always remove at least one instruction from any program. If this were really the 
case, all programs could ultimately be reduced to no instructions at all! But seriously, in code that 
is executed very frequently, removing even that “one instruction” can make a measurable perfor- 
mance difference. If you’ve ever examined compiler-generated code, even from so-called indus- 
trial-strength optimizing compilers, it’s not hard to see ways of improving it, even if it is just an 
instruction or two here or there. 


A huge share of the code written in any application is executed so infrequently that an extra in- 
struction here or there won’t make much difference, so there’s no reason not to use a high-level 
language. However, there may be some parts of your application that are executed very frequently 
indeed. If your program isn’t fast enough (is it ever?), there may be room to significantly speed it 
up with some judiciously applied assembly programming muscle. 


Even if you are completely satisfied with the performance of your PowerPC application, there 
are other reasons you might need know about PowerPC assembly language programming. Some- 
times, debugging compiled language code requires you to examine the generated assembly out- 
put, either to make sure it’s correct, or when stepping through it with debugging tools. 


This book was written to help make the job of developing software for PowerPC systems easier 
and more effective. We’ve gathered a wealth of information from numerous sources to provide 
you a single reference for the PowerPC architecture, instruction set, and programming conven- 
tions used by PowerPC compilers and systems software. We describe the PowerPC architecture 
and instruction set, and then go on to show you some of the tricks of programming PowerPC 
machines effectively, what works—and what to avoid—if you want maximum performance from 
your PowerPC machine. 
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The History of the PowerPC Microprocessor 


In 1992, the PowerPC alliance of IBM, Motorola, and Apple was formed to design and build 
microprocessors that would meet the needs of all three partners. The instruction set chosen for 
these processors is a refinement of the set used by IBM’s Performance Optimization With 
Enhanced RISC (POWER) machines. The POWER architecture was introduced in 1989 in 
IBM’s RS/6000 line of workstation computers. The instruction set for this machine has its 
origins with the 801 minicomputer that was developed at IBM’s T.J. Watson Research Lab in 
the late 1970s. 


The evolution of the POWER instruction set to the PowerPC set results in only minor changes 
to the instruction set and the programmer's view of the machine. Many of the changes affect 
only system interface instructions that never appear in user programs, but there also were some 
changes made to user-mode instructions. The most significant change is the addition of in- 
structions to support single-precision (32-bit) floating-point values. Also, some changes were 
made to allow for upward-compatible extension to 64-bit addresses and data values. The 
PowerPC 620 is the first available 64-bit PowerPC implementation; it is capable of executing 
32-bit PowerPC applications without modification. 


While it may seem strange to start with a 32-bit architecture when a 64-bit architecture is seem- 
ingly just around the corner, there were important reasons for doing this. First, the PowerPC 
architecture is close enough to the POWER architecture that existing POWER applications 
can run unchanged on the 32-bit PowerPC processors. This meant that there was already an 
existing code base for the new architecture which solves one of the hardest problems associated 
with introducing a new architecture—nobody buys a system without software, but nobody 
creates software for a system which no one will buy. Another advantage is that companies which 
were new to the POWER/PowerPC architectures could start developing software (operating 
systems, important applications, etc.) before the first PowerPC processors and systems were 
available by using POWER-based systems. Finally, it is questionable whether the 64-bit oper- 
ating systems and processors are really just around the corner for all market segments. It will 
probably be quite some time before 64-bit processors are needed for the consumer desktop 
and mobile computer markets. 


The PowerPC Instruction Set Architecture 


An instruction set architecture is the set of commands by which the (assembly language) pro- 
grammer can inspect and alter the state of the machine. The PowerPC microprocessor usually 
is described as a Reduced Instruction Set Complexity (RISC) architecture. Some of the RISC 
characteristics of the PowerPC microprocessor include a large set of general registers and in- 
structions that perform a single opétation rather than a sequence. Instructions that compute 
(add, shift, and so on), for example, operate only on registers; separate load and store 
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instructions are provided to transfer values between registers and from memory. This approach 
is different from a CISC architecture, in which a single instruction can perform both a calcu- 
lation and a load or store. 


Understanding the benefits of the RISC approach requires some idea as to how these opera- 
tions must be implemented in hardware. A simple explanation is that the load from memory 
usually takes a long time, at least compared to the time it takes to perform the add operation. 
In the CISC machine, the load operation is started, and the machine must wait for the value to 
be retrieved from memory before performing the addition. In the RISC machine, the same 
operation requires two instructions: one for the load, and one for the add. Superficially, this 
approach seems worse (why use two instructions when one instruction would suffice?). By sepa- 
rating the two operations into two distinct machine instructions, however, the astute program- 
mer (or compiler) is free to move the load instruction earlier in the program, placing other 
instructions that do not require the memory value between the load and the add instructions. 
When the value is needed for the computation, it already is available in a register, and the pro- 
cessor does not need to wait idly for the value to be returned from memory. 


PowerPC processors go this process one better, in that the processor itself can perform this 
kind of instruction scheduling on the fly; this is called out-of-order execution, or instruction 
scheduling. In many cases, the programmer does not need to worry about rearranging the or- 
der of instructions, because the processor can look ahead in the instruction stream and do the 
reordering all by itself. Even so, the hardware can look ahead only so far, and there are limits to 
the reordering capabilities for each PowerPC implementation. Although PowerPC processors 
can handle some of the instruction-reordering chores, it still is possible to improve performance 
by judicious instruction-scheduling and coding practices. 


PowerPC Processors 


A wide spectrum of PowerPC implementations has been announced or speculated about in 
the press, from embedded controllers to high-performance server engines. At the time of this 
writing, four general-purpose PowerPC processor chips are available or officially announced: 
the PowerPC 601, 603, 604, and 620 processors. IBM and Motorola also are introducing 
PowerPC processors designed for embedded control applications. 


The PowerPC 601 is targeted at the desktop workstation and PC market. The first available 
PowerPC systems, the IBM RS/G000 Model 250 and Apple’s first Power Macintoshes, are based 
on the 601 processor. The 601 is manufactured by IBM and is marketed by IBM and Motorola. 
The other three chips in the family are produced by both companies. 


The PowerPC 603 processor is a smaller, lower-cost implementation than the 601, designed 
primarily for use in portable and laptop machines. The 603 features active power 
management—when execution units are not needed, they are shut down. The 603 also 
supports software-configurable, low-power, standby modes. The PowerPC 604 is the 
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next-generation, high-performance chip intended for desktop and server applications. The 601, 
603, and 604 support 64-bit wide bus data transfers, but are 32-bit PowerPC implementa- 
tions by virtue of the 32-bit width of their machine registers and memory addresses. 


The PowerPC 620 is a high-performance desktop and server engine, and it is the first full 
64-bit PowerPC implementation. Registers in the 620 are 64 bits wide, and additional 
instructions for operating on 64-bit quantities are provided. These extensions are upward-com- 
patible, and the 620 is capable of running all 32-bit PowerPC applications. The principal 
feature of the 64-bit PowerPC architecture and the 620 is overcoming the 4-gigabyte limit on 
the effective address space imposed by 32-bit registers. The 64-bit 620 architecture increases 
the number of addressable memory locations and introduces a new virtual address translation 
architecture for virtual memory in 64-bit mode. The 620 also supports the 32-bit addressing 
modes, making it possible to run 32-bit operating systems and applications without modifica- 
tion on a 620 machine. 


From the applications programmer’s point of view, the various PowerPC processors are virtu- 
ally indistinguishable. Code sequences of user-mode PowerPC instructions produce identical 
results on the 601, 603, and 604, and these sequences define the same computation on the 620 
(although sequences that exploit the 64-bit wide general-purpose registers (GPRs) in the 620 
are not backward-compatible). 


Some specific features of the PowerPC 601 also bear mention. In addition to the 32-bit PowerPC 
instruction set, the 601 provides support for the IBM POWER instructions that were not in- 
cluded in the PowerPC. The intent was to help ease IBM’s transition from POWER to PowerPC. 
These POWER-only instructions are available in all 601-based systems, but it would be pru- 
dent to avoid them in code that might need to run on the 603, 604, 620, or future PowerPC 
processors. In almost all cases, the function provided by a “missing” POWER instruction is 
available in a more general form in a new PowerPC instruction. The only other difference 
between the 601 and other PowerPC implementations is in the number of the address transla- 
tion registers. This difference can be ignored safely everywhere but in the memory- 
management sections of the operating system. 


IBM and Motorola also are introducing PowerPC processors designed for embedded control 
applications. Most of the material in this book should be germane to programming these de- 
vices as well. 


PowerPC-Based Systems 


The PowerPC architecture is a detailed specification of the machine language. It doesn’t specify 
what happens outside the processor, however, and how a programmer can communicate with 
other elements of the system. Because PowerPC processors are being used in a wide variety of 
applications, from automobile engine controllers to video games to personal workstations to 
high-performance supercomputers, it makes sense to specify the system details separate from 
the processor architecture. 
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Even so, it is difficult to develop an interesting application that does not need to interact with 
devices beyond the processor and memory; video displays, disk drives, keyboards, and other 
devices occasionally come in handy! The designer of a system must link the external devices to 
the processor, and the interface to the “outside world” is usually through assignment of special 
meanings to the reading and writing of specific memory locations and external interrupts. The 
specifics of such conventions define a system architecture. 


The IBM Personal Computer system architecture was defined long ago around the Intel 8088 
processor. So that they could run IBM PC software, other PC makers adopted precisely the 
same system architecture. Despite limitations of the system design, compatibility with the ex- 
isting software base requires that the conventions defined in the original IBM PC be replicated 
faithfully in every PC-compatible machine to this day. 


Some examples of current PowerPC system architectures include the Apple Power Macintosh, 
the IBM RS/6000 PowerPC-based workstations, and the Motorola PowerStack machines. These 
systems are slightly different due to their lineage and to the specific hardware used in these 
systems. Future portable and multiprocessor systems based on the PowerPC will have their own 
unique hardware needs. 


In modern computer systems, the operating system manages the system devices and provides 
software interfaces to higher level applications. This approach, however, forces the operating 
system to have a detailed knowledge of the system architecture of the platform on which it is 
running, which slows or even prevents porting of operating systems between hardware plat- 
forms. 


The PowerPC Reference Platform (PReP) is a specification for a system architecture that will 
address these issues by providing a standard layer of abstraction, typically in software, between 
the operating system and the actual system hardware. This layer not only will allow systems 
composed of different hardware, but will allow the latest hardware technology to be incorpo- 
rated into computer systems with little impact on system software. PReP is an open standard 
being developed and supported by multiple computer manufacturers interested in PowerPC 
hardware and software development. At the time of this writing, the PReP specification has 
not been finalized. 


PowerPC Operating Systems 


Currently, at least six major operating systems are available or under development for PowerPC 
systems, including Apple Macintosh System 7, IBM AIX, IBM Workplace OS, Taligent, 
Microsoft Windows NT, and SunSoft Solaris. While all of these operating systems run on at 
least one other non-PowerPC platform, some of these operating systems currently will run only 
on a specific PowerPC system, but in the future they may be able to run on any 
PReP-compliant system. Although this large number of operating systems enables many exist- 
ing applications to be ported from other hardware platforms, it raises questions when develop- 
ing new software—for example, for which specific operating system to develop. 
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In addition, the PowerOpen organization was formed by a set of partners (including IBM, Apple, 
Bull, Harris, Tadpole Technology, and THOMSON-CSF) interested in developing PowerPC 
hardware and software. The purpose of the partnership is to develop standards, common 
interfaces, and the technology necessary to help minimize the cost of developing and distribut- 
ing PowerPC software. The PowerOpen environment specifies a UNIX type of operating 
system with the capability to run Macintosh applications, an X Window and Motif-based 
graphical user interface, and binary compatibility. } 
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In this chapter we will introduce some of the key concepts behind the PowerPC architecture. 
First we will discuss some more general topics as an introduction to modern computer archi- 
tecture. 


Amdahl’s Law 


One of the best strategies for increasing performance in modern computer architecture is to 
optimize the performance of the most common tasks. It is rarely worth investing effort to 
improve the performance of rare events—a change producing a two-fold increase in perfor- 
mance for a task that accounts for only 2 percent of a program’s overall execution time only 
improves overall performance by 1 percent. Doubling the speed of an event that accounts for 
98 percent of the execution time, however, improves overall performance by 96 percent— 
almost the full factor of two. This effect is known as Amdahl’s Law, which is one of the 
foundations of modern computer design. According to Amdahl’s Law the general equation for 
the overall speedup to a system, if a change is made to a single task within the system, is: 


1 
speedup = 


(1- Fraction Enhanced) + Fracten Enhanced. 


Speedup Enhanced 


Thus for the example above where we increased performance two-fold for the task that accounted 
for 2 percent of the total execution time of the program, we get 


1 


—_—_—_—_————— = 1.01 
(1 - .02) + (2 ) 


For the example above where we increased performance two-fold for the task that accounted 
for 98 percent of the total execution, we get 


1 


—_——————- = 1.96 
(1 - .98) + (3 ) 


This law applies equally well to software and hardware, and is why programmers are willing to 
spend time optimizing loop code and other pieces of code, which accounts for the bulk of a 
program's execution time. 


Much of this book is dedicated to the programmer attempting to optimize critical pieces of 
code. With modern compiler technology, there is little or no reason to attempt to code entire 
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programs in assembly language. Programmers can write code once and port that code to many 
different platforms by using a high-level programming language and letting the compiler opti- 
mize the bulk of the code. For some programs, a significant performance gain might be avail- 
able simply by rewriting parts of the code in machine-specific assembly language. This small 
portion of code would need to be rewritten for different microprocessor targets. By isolating 
the application of assembly language to small performance-critical sections of code, most of 
the performance benefits of assembly language coding can be realized without sacrificing gen- 


eral portability. 


The Ever Popular RISC Versus CISC Discussion 


Different instruction set architectures have different strengths and weaknesses when it comes 
to the topic of code optimization strategies, particularly in the areas of compiler and hand 
optimization strategies. One important classification of instruction set architectures is Reduced 
Instruction Set Complexity (RISC) architectures versus Complex Instruction Set Computer 
(CISC) architectures. There are many differences between RISC and CISC architectures other 
than the complexity of the instructions which comprise the architecture. While many of these 
differences are not exactly part of the definitions of RISC or CISC, they have come to be asso- 
ciated with these two architectural styles (as defining differences). 


The first major difference is that RISC architectures are typically load/store architectures, while 
CISC architectures are not. A load/store instruction set architecture restricts memory access 
operations to loading and storing values; there are no instructions that perform operations di- 
rectly on memory locations. A non-load/store architecture not only permits the loading and 
storing of data, but also combines other data manipulation operations with these memory- 
access operations in a single machine instruction. The PowerPC architecture is a load/store 
architecture—data must be loaded from memory into a register before it can be manipulated. 
Once the data has been manipulated, it must be stored from a register into a memory location. 


The Intel X86 and Motorola 68000 instruction sets are not load/store instruction set architec- 
tures, because both have instructions that directly manipulate memory. For instance, the X86 
instruction set has an add instruction—ADD AX,[BX]—that adds the contents of memory 
addressed by the value in BX to the contents of the AX register. This instruction would not be 
appropriate for a load/store architecture because it combines a memory access with an opera- 
tion on the value in memory. 


In order to study the advantages and disadvantages of load/store architectures, one must have 
an understanding of the system in which the processor is being used. For this discussion, we 
will assume a typical uniprocessor system (see Figure 2.1). 


A modern microcomputer system includes several components. At the core is the micropro- 
cessor. The microprocessor controls the rest of the system components through the processor bus. 
Almost all general purpose microprocessors in use today are synchronous processors—meaning 
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that the flow of data through the processor is controlled by a global clock that divides time into 
equal-length units called clock cycles. Typical microprocessor clock cycles range from about 20 
ns to 3 ns. The microprocessor is connected to a processor bus that also is often synchronous. 
Typical clock cycles for processor buses range from about 40 ns to 10 ns; these frequencies are 
limited somewhat by the physics associated with long wire runs (from 1 to 5 inches) on Printed 
Circuit Boards (PCBs). The frequencies at which microprocessors run have been increasing 
steadily over the past 20 years, while the access times for Dynamic Random-Access Memory 
(DRAM) used as main memory have been increasing at a much slower rate. The result is that 
modern microprocessors in microcomputers generally run at a higher frequency than the pro- 
cessor bus and certainly at a much higher frequency than main memory, which generally is 
made of DRAMs with typical access periods of 60 ns to 80 ns. 










FIGURE 2.1. 
A typical computer Off-chip cache 
organization. 
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Because of the difference between the access time of on-chip resources such as registers and off- 
chip resources such as main memory, data within the processor can be manipulated much more 
quickly than data stored in off-chip resources. This is the reason for the recent trend toward 
processors which run at higher clock speeds internally than the external interface is running. 
For instance, the 66MHz Intel 80486DX2 processor has an internal clock rate of 66 MHz, 
but its external interface runs at 33MHz. 


One of the advantages of a load/store architecture is that the long latency memory access can 
be separated from the short latency data manipulation. This capability enables the compiler or 
assembly language programmer to schedule short-latency, on-chip, data-manipulation instruc- 
tions between an access to data contained in an off-chip resource and the use of that data. This 
capability is known as code scheduling and is a very important concept when dealing with 
RISC/load-store architectures. 
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The PowerPC Architecture 


In this section, we introduce some of the specifics of the PowerPC architecture. This forms the 
groundwork needed to understand the instruction descriptions in Part II. 


Registers 


Registers are local memory in the processor where intermediate results are stored. The result of 
one instruction may be placed in a register, which in turn may be read by another instruction. 


There are several different types of registers defined by the PowerPC architecture. They can be 
classified into four major groups: general-purpose integer registers, floating-point registers, status 
and control registers, and special-purpose registers (see Table 2.1). 


Table 2.1. The basic PowerPC registers set (32-bit architecture). 


Name Size Number Description 

General-purpose 32 bits oe Registers available for integer 
integer registers* arithmetic and address calculation. 
Floating-point 64 bits 32 Registers available for floating- 
registers point arithmetic instructions. 
Condition 32 bits 1 Register used to direct branches. 
register It forms the communications path 


between the data flow and the 
control flow of a program. 


Fixed-point 32 bits l This register is used by the integer 

exception instructions to store information 

register about exceptional conditions which 
arise during the execution of 
instructions. 

Floating-point 32 bits l This register is used by the floating- 

status and point instructions to store 

control register information about exceptional 


conditions that arise during the 
execution of instructions. It is used 
to control the behavior of the 
floating-point execution unit. 


continues 
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Table 2.1 continued 








Name Size Number Description 

Link register* 32 bits 1 Used for the subroutine linkage 
address. 

Count register’ 32 bits 1 Used for coding loops with a set 


number of iterations. 
Segment register 32 bits 16 Used for memory management. 


*In 64-bit implementations these registers are 64 bits wide. 





General-purpose registers are used for most operations and are very important to code sched- 
uling. While not a defining characteristic of RISC architectures, the number of general 
registers is a difference between most current RISC architectures and most current CISC 
architectures. There are 32 general-purpose integer registers (often referred to simply as GPRs) 
defined by the PowerPC architecture. In 32-bit implementations, the GPRs are 32 bits wide, 
while in 64-bit implementations, the GPRs are 64-bits wide. Integer registers are used for 
integer arithmetic calculations and memory addressing. 


There are also 32 floating-point registers each of which can hold a 64-bit IEEE double preci- 
sion floating-point number or single precision floating-point number (IEEE floating-point num- 
bers will be discussed later in this chapter). Floating-point registers are used for floating-point 
arithmetic instructions. 


The next class of registers are the status and control registers. This includes the condition code 
register, the fixed-point exception register, and the floating-point status and control register. 


There are eight 4-bit condition code register fields within the condition register. These fields 
behave like a cross between general-purpose and special-purpose registers. The fields are sym- 
metric in most cases, that is, there are instructions that can access any condition register field 
in place of any other condition register field; however, there are certain non-symmetric cases. 
Integer arithmetic instructions can be coded to update condition register field zero automati- 
cally and floating-point arithmetic instructions can be coded to update condition register field 
one automatically. 


The condition register forms the primary information path between the data flow of a pro- 
gram and the control flow of that program. In other words, if the results of an instruction ex- 
ecution are needed to determine the direction of a branch, then the condition register is used 
to pass that information from the executing instruction to the dependent branch. This is an- 
other important issue relating to code scheduling. The determining event for the branch can 
be scheduled ahead of the branch so that the branch direction is known as early as possible. 


The fixed-point exception register (XER), also refered to as the integer exception register, is 
used to record exception conditions that arise during the execution of integer instructions. The 
carry bit is also stored in the XER. The carry bit is set by certain arithmetic operations typically 
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when a result needs an extra bit of precision, for instance, the add carrying instructions will set 
the carry bit when an addition causes a carry out of the most significant bit. Other instructions 
read the carry bit as an implicit operand, for instance, the extended add instructions add the 
carry bit to the other operands. 


The floating-point status and control register (FPSCR) is like the XER but for the floating- 
point unit. It is discussed more thoroughly in Appendix D (“A Detailed Floating-Point Model”). 


There are several special-purpose registers defined in the PowerPC architecture. Unlike the 
general-purpose registers, these registers each have a special meaning or purpose. They are gen- 
erally only accessed by a specialized set of instructions or when certain events occur within the 
processor. We will leave the discussion of most of these registers until Appendix C (“Operat- 
ing System Design for PowerPC Processors”), but we will mention two important registers here: 
the link register and the count register. These registers are associated with the branch instruc- 
tion portion of the architecture. The primary use of the link register is for subroutine linkage. 
Branch instructions may be coded to automatically load the subroutine linkage address into 
the link register. The subroutine linkage address is the return address associated with a subrou- 
tine call. The count register is used to control fixed length loop code. If the termination of a 
loop is determined by the completion of a number of iterations rather than by some condition 
being met, then the number of iterations can be loaded into the count register. Branch instruc- 
tions can be coded to automatically decrement the count register and test for the count reach- 
ing zero to determine the direction of the branch. 


Introduction to Memory 


In this section, we give a very brief description of virtual memory on PowerPC processors. This 
is primarily background information, as the operating system will typically hide memory man- 
agement from the application programmer. As this section is here for illustrative purposes, only 
32-bit memory management is discussed. For a more complete discussion of both 32-bit and 
64-bit memory management, see Appendix C (“Operating System Design for PowerPC Pro- 
cessors’). For a really complete discussion of memory management in the PowerPC architec- 
ture, we recommend obtaining the PowerPC Architecture Specification (see Appendix G, 
“Further Reading”). 


The PowerPC architecture supports virtual memory—a method of making a large memory space 
using a large storage device such as a hard drive for main storage and using the computer’s 
main memory, typically DRAM, as a cache for the main storage (see Figure 2.2). The address 
used to access virtual memory is called the virtual address, and in 32-bit PowerPC processors, 
the virtual address is 52 bits long. Virtual memory is split into 274 256-MB segments. Each 
segment is split into 2'° 4-KB pages that can be swapped in and out of memory by the operat- 
ing system. Finally, a 12-bit offset into the page enables each byte of memory to be addressed. 
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FIGURE 2.2. 


Virtual and physical 
memory. 


A programmer uses a 32-bit effective address to access memory. The upper order 4 bits access 
one of 16 segment registers on the processor. Each segment register contains a 24-bit virtual 
segment ID corresponding to one 256-MB segment. The next 16 bits in the effective address 
identify the page within the segment that is being accessed. The last 12 bits in the effective 
address identify the byte within the page being accessed. In order for the processor to access 
the appropriate page in main memory, it must look in the page table maintained by the oper- 
ating system. The page table maps virtual pages to real pages (physical blocks of 4KB). The 
processor looks for the virtual segment ID and page from the virtual address in the page table 
and uses the corresponding real page number as the upper order 20 bits of the physical address. 
The entire process involves going from a 32-bit effective address to a 52-bit virtual address, 
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and finally to a 32-bit physical address (see Figure 2.3). 


FIGURE 2.3. 
32-bit address translation. 
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The PowerPC architecture also supports a second way to map virtual memory to physical 
memory— block address translation. Although pages are always 4KB, blocks can range from 
128KB to 256MB. 


Addressing Modes 


Another difference between RISC and CISC architectures is that RISC architectures generally 
support fewer addressing modes. The X86 architecture includes four major addressing modes: 
register addressing, immediate addressing, direct addressing, and indirect addressing. There are three 
types of indirect addressing: register indirect addressing, based indirect addressing with displace- 
ment, and based indirect addressing with index and displacement. The PowerPC architecture 
defines instructions that use all but the direct addressing mode and the based indirect address- 
ing with index and displacement, although PowerPC does have based indirect addressing with 
index. Another difference between PowerPC and X86 or 68000 addressing modes is that load 
and store instructions are the only instructions that use addressing modes other than register 
addressing or immediate addressing. With both the 68000 and the X86 architectures, the di- 
rect and indirect addressing modes can be used with many of the instructions other than simple 
data move instructions. 


With register addressing mode, the address used identifies some register in the microprocessor 
(see Table 2.2); for PowerPC architecture, this register can be any of the registers described in 
Table 2.1, although any given instruction can access only certain registers. The integer add 
instruction, for example, can use only integer general-purpose registers, but the move from 
special-purpose register instruction can use any special-purpose register as a source, and any in- 
teger general-purpose register as a target. In the PowerPC architecture, most instructions can 
use register addressing. 


Table 2.2. Examples of register addressing for x86, 68000, and PowerPC architectures. 





x86 68000 | PowerPC 
MOV AX,BX MOVE D1,D2 of flair 
AX — BX D2< D1 rl — r2\lr2 
ADD AX,BX ADD D1,D2 add rl,r2,r3 
AX — AX+BX D2 + D1+D2 rl — r24+r3 
SUB AX,BX SUB D1,D2 subf r2,r2,r3 
AX <— AX-BX D2 + D2-D1 r2 — r3-r2 





With immediate addressing mode, a constant is used as an operand for the operation (see 


Table 2.3). 
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Table 2.3. Examples of immediate addressing for x86, 68000, and PowerPC architectures. 


x86 


MOV AX,5 

AX € 5 

MOV CX,0x01234567 
CX — 0x01234567 


ADD AX,5 
AX — (AX)+5 
SUB AX,5 
AX — AX-5 


68000 
MOVE Q #5, D1 
D¢€e5 


MOVE Q D1,#0x01234567 
D — 0x01234567 


ADDI #5,D2 
D2 + (D2)+5 
SUBI #5,D2 
D2 + (D2)-5 


PowerPC 


addi r1,r0,5 
rl —0+5 


addis r1,r0,0x0123 
rl — 0+0x01230000 
addi rl,r1,0*4567 

rl < 
(r1)+0x00004567 


addi r2,r2,5 
r2 <— (r2)+5 
subic r2,r2,5 
r2<—(r2)—5 


Indirect addressing modes use the contents of a register to address memory. Variations on this 
include using the contents of a register with some constant displacement and using the con- 
tents of a register with another register as an index (see Table 2.4). 


Table 2.4. Examples of indirect addressing for x86, 68000, and PowerPC architectures. 


x86 

MOV AX, [BX] 

AX < [(BX)] 

MOV CX, [BX+1] 
CX <— [(BX)+1] 
ADD AX,[BX+5] 
AX(AX)+[(BX)+5] 


ADD AX,[BX+S$1+4] 
AX(AX)+ 
((BX)+(S1)+4] 


68000 

MOVE (Al) D1 

D1 — [(Al)] 
MOVE (1,A1),D1 
D1 € [(Al1)+1] 
ADD (5,A1),D1 
D1<(D1)+[(A1)+5] 


ADD (4,A1,A2.32*1),D1 
D1<(D1)+ 
[(A1)+(A2)+4] 


PowerPC 


lwzx r1,r0,r2 
rl — [(r2)] 


lwz rl, 
rl — [(r2)+1] 


lwz rl, 

rl — [(r2)+5] 
add r3,r3,r1 

r3 — (r3)+(r1) 
addi r2,r1,4 

r2 — (rl)+4 
lwzx re,r2,r4 

r3 — [(r2)+(r4)] 
add r5,r5,r3 

rs = (r5)+(r3) 
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Processing Modes (User Versus Supervisor) 


The PowerPC architecture supports two processing modes: user mode and supervisor mode. This 
architecture enables operating systems to limit the access of machine-critical resources in order 
to protect the system from wayward user programs. In user mode, certain processor resources 
cannot be accessed or changed. In addition, memory pages and blocks have two different sets 
of access permissions: one for user mode and one for supervisor mode. Thus, the operating 
system can set up memory pages that can be read but not updated by user code, or it even can 
set up memory pages that cannot be accessed at all by user programs while still maintaining 
read and/or write permission itself. 


Integer Processor Architecture 


An architecture consists of a framework of instructions and storage elements on which the in- 
structions operate in some consistent manner. This framework is a mathematical model that 
can be used to solve problems. In order to use this model, you must understand it in addition 
to understanding the instructions and storage elements that make up the framework. Most of 
this book is concerned with the framework, but here the underlying mathematical model is 
introduced, starting with the integer processor architectural model. 


Integer Number Representations 


The signed integer representation used in the PowerPC architecture is called two's complement 
representation. Using 32 bits, two’s complement representation can represent integers between 


—2,147,483,648 and +2,147,483,647. 


Zero has exactly one representation: 000000000. The positive integers are represented by their 
standard hexadecimal number (1=0x00000001, 12=0x0000000C, and the largest represent- 
able positive integer, 2,147,483,647, =OX7FFFFFFF). 


The hexadecimal number that is one greater than the largest integer, 080000000, represents 
the largest (magnitude) negative number, —2,147,483,648. From the largest negative number 
through, each time 1 is added, the 1 is added to the number being represented (0x80000001=— 
2,147,483,647, and OXFFFFFFFF=-1). 


In order to negate a two’s complement number, the hexadecimal representation is inverted 
logically and then incremented by 1. So starting with 1, 080000001, you invert to get 
OxFFFFFFFE, and increment to get OXFFFFFFFF=-1. 


Notice, however, if you start with —2,147,483,648=0x80000000, you invert to get 
Ox7FFFFFFF, and add 1 to get 0X80000000=—2, 147,483,648. What happened? There is no 
32-bit two’s complement representation of +2,147,483,648; overflow occurred. 
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Floating-Point Processor Architecture 


One of the biggest differences between the PowerPC architecture and Intel’s X86 and Motorola’s 
68000 architectures is in the integrated floating-point unit. Often, scientists and engineers 
represent numbers in floating-point form; for example: 


1.2375107=1.2375e2=123.75 


Floating-point numbers contain three parts: a base, a mantissa, and an exponent. In the earlier 
examples, 1.2345 is the mantissa and 2 is the exponent. The base of the numbers is 10, that is 
the number raised to the exponent power. In a computer, numbers usually are represented in 
base 2, and floating-point numbers are not an exception. The mantissa and exponent are rep- 
resented in base 2 and the base is 2. For example, the following is a floating-point number in 
base 2: 


1.11101111%2'=1111011.11=111.101111x2'%=123.75. 


Notice that just as binary digits increase by a factor of 2 for each step to the left of the binary 
point, they also decrease by a factor of 2 for each step to the right of the binary point. Thus, 
0.11 in binary is '/2+'/4=%/4. In the preceding example, the same number had two different 
floating-point representations. Actually, every number has an infinite number of representa- 
tions. One special representation is when there is a single, non-zero digit to the left of the bi- 
nary point in the mantissa. Numbers in this form are called normalized numbers. 


Execution Model 


The PowerPC floating-point execution model is based on the IEEE 754-1985 floating-point 
standard. It supports both single and double precision operations. Single precision floating-point 
numbers consist of a 24-bit unsigned mantissa, a sign bit, and an 8-bit signed exponent. The 
leading digit of the mantissa is assumed to be a 1—the number is assumed to be normalized, 
which enables the 33-bit number to be stored in 32 bits (the leading bit of the mantissa is not 
stored). Single precision exponents range from —126 to +127. Single precision mantissas can 
range from 1 to 21. Finally, the sign bit specifies whether the number is positive or negative. 


Double precision floating-point numbers consist of a 53-bit unsigned mantissa, a sign bit, and 
an 11-bit signed exponent. Again, the numbers are assumed to be normalized, thereby allow- 
ing the 53-bit mantissa to be stored in 52 bits, and the entire double precision floating-point 
number to be stored in 64 bits. Double precision exponents range from —1022 to +1023. Double 
precision mantissas can range from 1 to 2-1. Finally, the sign bit specifies whether the num- 
ber is positive or negative. With the numbers as defined here, the range of values for single 
precision floating-point numbers is from —2!78+2' to 2'78-2!%, or roughly —3.4x10** to 
3.4x10°8. Double precision numbers can range from —2!°4+27”7 to 2'°4—-2°”?, or roughly 
—1.8x10° to 1.8x10°". 
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Notice that with this straightforward encoding, that there is no way to represent the number 0. 
Because of the implied 1 to the left of the decimal point, the closest number to 0 that can be 
represented is a fraction of 0 (meaning 1.0000...00) with the minimum exponent. In order to 
remedy this problem, some special encodings or special numbers have been designated. 


Special Numbers 


Certain representations have special meanings in the PowerPC architecture. Zero, for example, 
is represented by a mantissa of 0 with an exponent of e . ((e_..)=—1032 for double precision, 
—127 for single precision). The sign bit is ignored for 0 values (+0=—0). Zero really is a special 
case of a broader set of special numbers called denormalized numbers. Denormalized numbers 
have a 0 to the left of the binary point—they are not normalized and do not have an implicit 
1 for the most significant bit of the mantissa. Any number with e . for an exponent value is a 
denormalized number. Sometimes, support for denormalized numbers is referred to as gradual 


underflow. 


Infinity is another special number. For many applications, it is better for a very large number 
to get clamped at infinity rather than to wrap around to a very small number as integers do— 
maximum positive integer + 1 = maximum negative integer for signed integer arithmetic, or 0 
for unsigned integer arithmetic. Instead, PowerPC floating-point numbers go to infinity. In- 
finity is represented by fraction of 0 with the maximum exponent value (e___)((e,.)=+1024 
for double precision, +128 for single precision). The sign bit differentiates positive and 
negative infinity. 


The final type of special number is called not a number (NaN). These numbers are used to 
represent certain exceptions or the results of invalid operations. These representations come in 
two forms: signaling NaNsand Quiet NaNs. Signaling NaNs are represented by a mantissa with 
the most significant fraction bit set to 1 and an exponent of e__, while quiet NaNs are repre- 
sented by a mantissa with the most significant fraction bit set to 0 and an exponent of e 


NaNs are discussed in Appendix D, “A Detailed Floating-Point Model.” 


Rounding 


When an operation is performed on floating-point numbers, the result may have greater pre- 
cision (more digits) than either of the initial operands. For example: 


1.2x10°x1.2«10'=1.44x10! 


In this example, each of the operands has a single digit after the decimal place, but the result 
has two digits after the decimal place. Remember, however, that the floating-point representa- 
tions in the PowerPC architecture have a fixed number of positions after the binary point. This 
means that after a calculation is performed, the result is rounded to fit into the mantissa of the 
appropriate representation. 
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Four rounding modes are supported by the PowerPC architecture (see Table 2.5). When a 
number does not fit into the target register, there are two representable numbers that are clos- 
est to it. The different rounding modes enable the programmer to specify which of these two 
numbers is chosen as the final (rounded) result. 


Table 2.5. Floating-point rounding modes supported by the PowerPC architecture. 
Rounding Mode Description 


Round to Nearest Choose the closest representable value. If the 
number is exactly between the two closest values, 
choose the even one. 





Round toward zero Of the two closest representable values, choose 
the one that is closer to zero. 

Round toward positive Of the two closest representable values, choose 
infinity the larger one. 

Round toward negative Of the two closest representable values, choose 
infinity the smaller one. 

° 
Registers 


The PowerPC architecture defines 32 floating-point registers. Each of these registers can hold 
one double precision floating-point number or one single precision floating-point number. In 
addition, there is a 32-bit floating-point status and control register that contains flags which 
tell the floating-point unit how to perform certain actions (for example, which rounding mode 
to use) and flags that can tell a programmer about exceptional events (when a divide-by-zero 
has occurred, for example). 


Branch Processor Architecture 


The branch processing portion, or unit, of the architecture describes a set of registers and in- 
structions that enable the programmer to specify how instructions are fetched from memory. 
Three of the special-purpose registers described earlier in the chapter are related to the branch 
processor: the condition register, the link register, and the count register. The condition register 
is used to direct branch instructions. A branch instruction may test a bit in the condition reg- 
ister to determine whether the target instructions should be fetched, or whether fetching should 
continue with the instruction immediately following the branch in memory (see Chapter 4, 
“Branch and Control Flow Functions”). The link register is used for subroutine linkage (see 
the following section), and the count register is used for several miscellaneous functions. 
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Subroutine Linkage 


The “ink register is used to hold the return address for subroutine calls. A branch instruction 
that is used to call a subroutine will update the link register with the address of the instruction 
immediately following the branch in memory. This is the return address for the subroutine. If 
the subroutine calls another subroutine, then the link register first must be copied into a gen- 
eral integer register and then placed onto the software-maintained link stack—the stack of 
addresses that corresponds to the function caller stack. 


Other Branch Structures 


The count register is used for other branch structures. The first use is for loop structures. Branches 
that automatically decrement the count register and branch, based on a comparison of the count 
register contents to zero, can be used for loop constructs (for...next loops, for example). The 
count register first is loaded with the iteration variable. The branch instructions then can be 
coded to perform the loop. The second use for the count register is to supply a target address. 
This is used for long branches (branching farther in memory than can be coded using other 
branch instructions, for example), branches to function pointers, and calculated goto-type 
branches (switch statements in C, for example). 


Interrupt Architecture 


The PowerPC architecture defines several types of interrupts—each with its own address. The 
address of an interrupt is the memory location that the program jumps to when the interrupt 
occurs. The interrupt architecture is described in detail in Appendix C, “Operating System 
Design for PowerPC Processors.” Two interrupts of particular interest to the assembly-language 
programmer are discussed here: the system call interrupt and the program interrupt. These are 
the interrupts a programmer may use to perform some action. When an interrupt occurs, the 
processor goes into privileged mode, which enables code to access certain facilities with restricted 
access rights and to access data structures in memory that have restricted access rights (see 
Appendix C). Generally, these are data structures set up by the operating system, which are 
sensitive in some way (if they are updated inappropriately, for example, the machine may crash). 
If a programmer wants to access some operating system facility, then he can use a system call 
instruction (see Chapter 4) to cause a system call interrupt. Based on the contents of certain 
registers, the operating system will perform some service—possibly placing some data in the 
registers—and then return control back to the program that issued the system call instruction. 
In essence, the system call instruction is a subroutine call to an operating system subroutine. 


The other interrupt of particular interest to the assembly programmer is the program interrupt. 
This interrupt can be caused by a trap instruction (see Chapter 3). Typically, trap instructions 
check for some error condition and then ¢rap (cause a program interrupt) if such an error oc- 
curs. From that point, the operating system may try to fix the error or perhaps simply abort the 
program. 
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Notation 


In the following chapters, the notations shown in Table 2.6 are used. Throughout the book, 
bits are numbered starting with the most significant bit as 0, and ending with the least signifi- 
cant bit as the operand size minus one. Thus, for a 32-bit number, bit 0 is the most significant 
bit, and bit 31 is the least significant. This is the convention used for PowerPC bit numbering 
and may be the reverse of what you are used to seeing for other architectures. In addition, the 


following symbols are used. 


Table 2.6. Notations used in this book. 


Notation 
(Rx) 





[x] 


<< 


>> 


u< 


u> 


Description 


Register reference. This means the contents of Rx. Thus, (R2) 
means the data contained in register R2. 


Memory reference. This means the memory location ad- 
dressed by the value x. 


Assignment statements. The object on the left of the assign- 
ment symbol is given the value of the object on the right side 
of the symbol. 


Shift left. A<<B means that A is shifted left by B bits. 
Shift right. A>>B means that A is shifted right by B bits. 


Signed less than. This means less than and uses a signed 
comparison. Thus, OxFFFF < 0x0000 (—1 < 0). 


Unsigned less than. This means less than and uses an unsigned 
comparison. Thus, 0x0000 u<OxFFFF (0 u< 65,535). 


Signed greater than. This means greater than and uses a signed 
comparison. Thus, 0x0000 > OxXFFFF (0 > —1). 


Unsigned greater than. This means greater than and uses an 
unsigned comparison. Thus, OXFFFF u> 0x0000 (65,535 u> 
0). 

Equal to. Thus, 0x0000 == 0x0000 (0 == 0). 

Not equal to. Thus, 0x0000 != OXFFFF (0 != —1). 
Concatenate. Thus, 0xF || 0xD means 0XFD. 

Boolean AND. OXxF55F & OX5FAF = 0X550F. 

Boolean OR. OXF55F | OXSFAF = OxFFFF. 


Boolean NOT. [A] means not A. So [A] | A = Ob1 and [A] & 
A = Ob0. 


Notation 


sign_ext() 


* 
> x 


% 


(Rn)_0 
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Description 


Boolean XOR. Thus, [A] ® A = Obl and A @ A = ObO. 
Boolean equivalence. Thus, [A] = A = 0b0 and A=A = Obl. 


Sign extend. sign_ext(OXFF) = OXFFFF, and sign_ext(0x7F) = 
0x007F. 


Subfield. Subscripts used in this fashion mean a subfield of the 
contents of Rn. Thus, Rn, means bits 0 through 8 of the 
register Rn. 


Repeat. This is a repeat symbol. *A means to concatenate a 
copies of A. 


Addition. 00001 + OxXFO01 = OxFO02. 

Subtraction. 0x0444 — 0x0434 = 0x0010. 

Multiplication. 0x0002 * 0x0004 = 0x0008. 

Modulo. This means modulo or remainder. A%B means the 
remainder if you divide A by B. 

Divide. A/B means A divided by B. 

Conditional operator. An if...then...else structure. The 
statement A?B:C takes the value of B if A evaluates to TRUE; 
otherwise, it takes the value of C. 


Contents of Rn if n is 1-31, 0 if n is O. 
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Instruction Descriptions 





In this part we describe the PowerPC instruction set and give some programming ex- 
amples to show how the instructions are used. The PowerPC instruction set can be split 
into 3 main areas: integer processing instructions, control flow processing instructions, 
and floating point processing instructions. We will discuss each of these types of instruc- 
tions below. A fourth area involves processor and system control instructions. These 
instructions are generally used by an operating system and are not available to user (ap- 
plication) code (see Appendix C, “Operating System Design for PowerPC Processors”). 


The PowerPC architecture defines extended mnemonics, in addition to the actual hard- 
ware-implemented instructions. Extended mnemonics are translated by the assembler into 
hardware instructions and are only meaningful to the programmer. For instance, the 
subtract (sub) instruction is an extended mnemonic which is translated into the subtract 
from (subf) instruction (these instructions are described in more detail below). The subf 
instruction subtracts the first operand from the second operand. The sub instruction 
subtracts the second operand from the first operand. When an assembler sees a sub in- 
struction, it simply swaps the first and second operands and changes the sub instruction 
to a subf instruction. The extended mnemonics can be used to make code easier to read. 
Extended mnemonics are used throughout this book but are typically noted as extended. 


Note that we follow a convention where RT always denotes a target register, while RS, 
RA, and RB always denote source registers. The processor users’ guides and architecture 
definition use a different (and more complicated) convention for naming the source and 
destination systems; we chose a simpler scheme that should be more intuitive for 
programmer's purposes. 


Integer Instructions 
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The integer instructions can be split into four subsections: the arithmetic instructions, the /ogi- 
cal operation instructions, the rotate and shift instructions, and the Joad/store instructions. All 
integer instructions operate on data contained in the integer general-purpose registers. In ad- 
dition, the condition register can be updated by integer instructions, and there is a special reg- 
ister that is used by integer instructions called the fixed point exception register or the XER. 


Arithmetic Instructions 


The arithmetic instructions perform the standard integer arithmetic operations—including 
addition, subtraction, multiplication, and division. The addressing modes supported by these 
instructions are immediate addressing and register addressing. All these instructions read their 
sources and store their results into integer general-purpose registers. 


For some of these instructions, it is possible to update the condition register with information 
about the result of the instruction or the XER with information about the execution of the 
instruction. These registers are known as implicit targets because the registers are not explicitly 
specified in the instruction; instead, a suffix is added to the mnemonic to indicate to the as- 
sembler that a slightly different form of the instruction should be used. 


To indicate that the condition register should be updated with information about the result, a 
dot [.] is added to the instruction mnemonic. In this case, the processor updates condition register 
field zero with a set of flags describing the result. The following four bits of the condition reg- 
ister field are set: bit 0 is set if the result is negative, bit 1 is set if the result is positive, bit 2 is 
set if the result is zero, thus exactly one of the first three bits is set, and bit 3 is set if this instruc- 
tion experienced an overflow condition. 


To indicate that the XER should be updated with information about the execution of the in- 
struction, an o [o] should be added to the mnemonic. In this case, two bits in the XER may be 
updated: the overflow bit and the summary overflow bit. The overflow bit is set to 1 if the 
instruction experienced an overflow condition during execution and to 0 if the instruction 
completed without overflow. The summary overflow bit is set to 1 if the instruction experi- 
enced an overflow condition during execution, and is not changed if it did not experience an 
overflow condition. 


Integer Add Instructions 


The PowerPC architecture includes eight integer add instructions. All instructions have two 
operands and store their primary into general-purpose integer registers. The general form of an 
add instruction is: 


add RT, RA, source 
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The contents of RA are added to source] and the result is placed into RT. source1 can be the 
value of an immediate field (immediate addressing mode), the contents of a general-purpose 
integer register RB (register addressing mode), or implied by the specific instruction used. 


Three of the add instructions get sourcel from the immediate field. The first two—add imme- 
diate (addi) and add immediate shifted (addis)—can be used together to generate a 32-bit con- 
stant in a register. The addi instruction adds RA to the immediate field sign extended to 32 
bits, and the addis instruction adds RA to the immediate field shifted left by 16 bits. These two 
instructions follow the conventions of the load/store address-generation instructions: If RA is 
registered to 0, then the immediate field is added to 0 rather than to the contents of RO. The 
next add instruction—add immediate carrying with and without condition register record 
(addic{.])—performs the add of RA and the immediate value sign extended to 32 bits, and 
places the carry out of the add into the XER register. 


Three add instructions use the contents of RB for sourcel. Each of these three instructions can 
update the condition register or not, and update the XER with overflow information or not. 
The basic add instruction (add[o][.]) simply adds the contents of RA to the contents of RB 
and puts the result in RT. If the carry out of the add is needed, then the add carrying instruc- 
tion should be used (addc[o][.]). The add carrying instruction places the carry out from the 
addition into the carry bit in the XER. The remaining add instruction, which uses RB, the add 
extended (adde[o][.]) instruction adds RA and RB to the contents of the carry bit in the XER 


and places the result into RT. 
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The last two add instructions are special cases of the add extended instruction. The first is the 
add to minus one extended instruction (addme[o][.]), which adds RA to —1 and the carry bit in 
the XER. The second is the add to zero extended instruction (addze[o][.]), which adds RA to 0. 
Table 3.1 lists the add instructions. 
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Table 3.1. Add instructions. 


Instruction Definition Format 
add immediate RT <-(RA)_O+sign_ext(SI) addi RT,RA,SI 
add immediate shifted RT <-(RA)_0+(SI<<16) addis RT,RA,SI 


add immediate carrying RT <-(RA)+sign_ext(SI) addic[.] RT,RA,SI 
XER..,<-carry_out 


add RT <-(RA)+(RB) add[0][{.] RT,RA,RB 
add carrying RT <-(RA)+(RB) addc[0][.] RT,RA,RB 
XER_.,<-carry_out 
add extended RT <-(RA)+(RB)+XER_., adde[0][.] 
RT,RA,RB 
XER_,<-carry_out 
add to minus RT <-(RA)+XER.,—1 addme[0][.] RT,RA 
one extended XER,.,<-carry_out 
add to zero extended RT <-(RA)+XER., addze[0][.] RT,RA 


XER_,<-carry_out 
Note: If [0] is added, XER,,. and XER,,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 


Integer Subtract Instructions 


The PowerPC architecture includes 12 integer subtract instructions. All instructions have two 
operands and store their primary into general-purpose integer registers. One of the two oper- 
ands comes from a general-purpose integer register; the other operand can come from an im- 
mediate field, it can come from a general-purpose integer register, or it can be implied by the 
instruction. The general form of the subtract instructions follows: 


sub RT, RA, source 


RA is subtracted from sourcel or sourcel is subtracted from RA, depending on the instruc- 
tion, and the result is placed into RT. sourcel is the immediate field for subtract instructions 
using the immediate addressing mode, a register (RB) for subtract instructions using the regis- 
ter addressing mode, and an implicit value for certain instructions. 


There are five immediate mode subtraction instructions. The first two instructions, subtract 
immediate (subi) and subtract immediate shifted (subis) are extended mnemonics of the add 
immediate (addi) and add immediate shifted (addis) instructions described earlier. The only 
difference between these instructions and the addi/addis instructions is that RA is added to the 
2's complement of the sign extended immediate value (possibly shifted left by 16 bits). The 
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next immediate mode subtract instruction, subtract immediate carrying with and without con- 
dition register update (subic[.]), is an extended mnemonic of the add immediate carrying in- 
struction described earlier. Again, the immediate field is negated (2’s complement) and then 
added to RA. The subtract from immediate carrying instruction (subfic) negates RA (2’s comple- 
ment) before performing the addition of RA and the immediate field; thus, RA is subtracted 
from the immediate field and the result is placed in RT. There also is a single source subtract 
instruction that negates the source register (neg[o][.]). 


Seven subtract instructions use the register addressing mode. The subtract from instruction 
(subffo][.]) stores RB minus RA into RT. The subtract from carrying instruction (subfc[o] [.]) 
subtracts RA from RB and sets the carry bit in the XER as a carry into the subtractor. The 
result is placed into RT. The simple subtract instruction (sub[o][.]) and the subtract carrying 
instruction (subc[o] [.]) are extended mnemonics of the subf[o][.] and subfc[o][.] instructions, 
respectively. Simple subtract and subtract carrying work similarly to subf[o][.] and subfc[o]|[.], 
except that RA and RB are reversed so that RT is loaded with RA-RB instead of RB—RA. The 
last three subtract instructions add carry bit in the XER into the result. The subtract from ex- 
tended instruction (subfe[o][.]) stores RB minus RA with the carry bit in the XER as a borrow 
from the subtractor (if the carry bit is a 1 then there is no borrow, while if the carry bit is a 0 
there is a borrow) into RT. The subtract from minus one extended instruction (subfme[o]|[.]) 
subtracts RA from —1 with the carry bit in the XER as a borrow from the subtractor. The sub- 
tract from zero extended instruction (subfze[o][.]) subtracts RA from 0 with the carry bit in the 
XER as a borrow from the subtractor. Table 3.2 lists the sutract instructions. 


Table 3.2. Subtract instructions. 


Instruction Definition Format 

negate RT <-—(RA) neg[0][.] RT,RA 
subtract immediate RT <-(RA) |0-sign_ext(SI) subi RT,RA,SI 
shifted 

subtract immediate RT <-(RA) |0-—(SI<<16) subis RT,RA,SI 
shifted 

subtract immediate RT <-(RA)-sign_ext(SI) subic[.] RT,RA,SI 
carrying XER.., <-carry_out 

subtract from immediate RT <-sign_ext(SI)—(RA) subfic RT,RA,SI 
carrying XER_., <-carry_out 

subtract RT <-(RA)—(RB) sub[0][.] RT,RA,RB 
subtract carrying RT <-(RA)—(RB) subc[0][.] RT,RA,RB 


XER,., <-carry_out 
subtract from RT <-(RB)—(RA) subf[0][.] RT,RA,RB 


continues 
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Table 3.2. continued 


Instruction Definition Format 
subtract from carrying RT <-(RB)—(RA) subfc[0][.] RT,RA, 
XER.., <-carry_out RB 


extended subtract from RT <-(RB)—(RA)—1+ XER.,  subfe[0][.] RT,RA, 


XER,., <-carry_out RB 
extended subtract from RT <—(RA)+XER_,-2 subfme[0][.] RT,RA 
minus one XER,, <-carry_out 
extended subtract from RT <-—(RA) +XER.,—1 subfze[0][.] RT,RA 
zero XER.., <-carry_out 


Note: If [0] is added, XER,, and XER,,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 
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Integer Multiply Instructions 


The PowerPC architecture includes five integer multiply instructions. All these instructions 
have two operands and store their result primary into a general-purpose integer registers. One 
of the two operands comes from a general-purpose integer register; the other operand can come 
from an immediate field (immediate addressing mode), or from a general-purpose integer reg- 
ister (register addressing mode). The general form of the multiply instructions follows: 


mul RT, RA, source 


RA is multiplied by source1, and the result is placed into RT. source is the immediate field 
for multiply instructions using the immediate addressing mode and a register (RB) for multi- 
ply instructions using the register addressing mode. 
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The PowerPC architecture defines multiply instructions to enable a programmer to multiply 
two integers and get the full range of the results. Note that when two 32-bit numbers are 
multiplied, the result may require as many as 64 bits to be represented. In order to accommo- 
date this requirement, the multiply instructions are split into two classes: Those that give the 
upper order bits as the result, and those that give the lower-order bits as the result. 


There are three multiply instructions defined for 32-bit implementations of the architecture. 
The multiply low immediate instruction (mulli) multiplies RA by the sign extended immediate 
field (SI). The low-order half of the 64-bit result is placed into the target register. The mu/tiply 
low word instruction (mullw[o][.]) uses the register addressing mode and multiplies RA by RB, 
placing the low-order 32 bits of the result into RT. If the mullwo[.] form of the instruction is 
used, then overflow bit in the XER is set to 1 if the result cannot be represented in 32 bits. The 
last 32-bit multiply instruction is the multiply high word instruction (mulhw[u][.]). This in- 
struction produces the upper-order half of the 64-bit result. RA is multiplied by RB and the 
upper-order half of the 64-bit result is stored in RT. If the mulhwu|.] form of the instruction 
is used, then RA and RB are treated as unsigned integers; if the mulhw[.] form of the instruc- 
tion is used, then RA and RB are treated as signed integers. 





ing tv Sen igned ¢ {-bit numbers on a 32-bit machine (producing a 128- 







example, you multiply two 64-bit numbers. One operand is contained in R1 
order 32 bits) and R2 (low-order 32 bits), and the other operand is contained in 
gh-order 32 bits) and R4 (low-order 32 bits). Multiplying two 64-bit numbers 
oduce a 128-bit result, which you will place in R5 (high-order 32 bits) through 
w-order 32 bits). You will perform the multiplication in the same way that you 

ly multiply multidigit numbers by hand; remember that multiplying two 32-bit 
ers produces a 64-bit result (see Figure 3.1). Table 3.3 lists the 32-bit multiply 
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FIGURE 3.1, _R2 | 
Unsigned multiplication of 


two 64-bit numbers on a 
32-bit machine. 


= | 
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2 X 
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Table 3.3. 32-bit multiply instructions. 


Instruction Definition Format 

multiply low RT <-((RA)*sign_ext(SI)),,,.  mulli RT,RA,SI 

immediate 

multiply low word RT <-((RA)*(RB)),..., mullw[0][.] RT, 
RA,RB 

multiply high word RT <-((RA)*(RB)),.,, mullhw[u][.] RT, 
RA,RB 


Note: If [o] is added, XER,, and XER,,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 





Five multiply instructions are defined for 64-bit implementations of the PowerPC architec- 
ture. The first three instructions are the instructions that were defined earlier for 32-bit imple- 
mentations. The mu/tiply low immediate instruction (mulli) multiplies RA by the sign extended 
immediate field (SI) and stores the low-order 64 bits of the 128-bit result in RT. The multiply 
low word instruction (mullw[o]|[.]) multiplies the low-order 32 bits of RA by the low-order 32 
bits of RB and stores the 64-bit result in RT. If the mullwo[.] form of the instruction is used, 
then the overflow bit in the XER is set to 1 if the result cannot be represented in 32 bits. The 
multiply high word instruction (mulhw[u]|[.]) multiplies the low-order 32 bits of RA by the low- 
order 32 bits of RB and stores the high-order 32 bits of the result into the low-order 32 bits of 
RT. The high-order 32-bits of RT are left undefined. If the mulhwu[.] form of the instruction 
is used, then RA and RB are treated as unsigned; if the mulhw[.] form of the instruction is 
used, then RA and RB are treated as signed. 


The last two multiply instructions are defined only for 64-bit implementations of the architec- 
ture. The first instruction, multiply low doubleword (mulld{o][.]) is very similar to the mullw[o][.] 
instruction. The low-order half of the 128-bit product of RA and RB is stored in RT. If the 
mulldo[.] form of the instruction is used, then the overflow bit in the XER is set to 1 if the 
product cannot be represented in 64 bits. The multiply high doubleword instruction (mulhd{u] [.]) 
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is very similar to the mulhw[u]|[.] defined earlier. The high-order half of the 128-bit product of 
RA and RB is stored in RT. If the mulhdu[.] form of the instruction is used, then RA and RB 
are treated as unsigned integers; if the mulhd|[.] form of the instruction is used, then RA and 
RB are treated as signed integers. Table 3.4 lists the 64-bit multiply instructions. 


Table 3.4. 64-bit multiply instructions. 


Instruction Definition Format 
multiply low immediate RT <- ((RA) * sign_ext(SI))<,,,. mulli RT,RA,SI 
multiply low word RY «(RAD * (RB)... mullw[0][.] RT, 
RT os) undefined RA,RB 
multiply high word RT a CRA) mulhw[u][.] RT, 
* (RB),2.<3)031 RA,RB 
RT 31 < undefined 
multiply low RT <- ((RA) * (RB))<,..,, mulld[0][.] RT, 
doubleword RA,RB 
multiply high RT <- ((RA) * (RB)),.., mulhd[u][.] RT, 
doubleword RA,RB 


Note: If [o] is added, XER,, and XER,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 


Integer Divide Instructions 


The PowerPC architecture includes two integer divide instructions. Both instructions have two 
operands and store their primary result into general-purpose integer registers. Both of the op- 
erands come from general-purpose integer registers. The general form of the divide instruc- 
tions follows: 


div RT, RA, RB 


RA is divided by RB, and the result is placed into RT. Specifically, RT is set to a value so that 


the following equation is satisfied: 
RA = (RTXRB) + r, where —|RAI<r</RAI. 


The remainder is not specifically supplied as a result of any of the divide instructions in the 
PowerPC architecture (see Example 3.4). 
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One integer divide instruction is defined for 32-bit implementations of the PowerPC architec- 
ture: the divide word instruction (divw[u][o]|[.]). If the divwu[o][.] form of the instruction is 
used, then RA and RB are treated as unsigned integers; if the divw[o][.] form of the instruction 
is used, then RA and RB are treated as signed integers. The division is performed as described 
earlier. Table 3.5 lists the 32-bit divide instructions. 


Table 3.5. 32-bit divide instruction. 
Instruction Definition Format 
divide word RT <- (RA) / (RB) divw[u][o][.] RT, RA, RB 
Note: If [o] is added, XER,,, and XER,,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 


Two integer divide instructions are defined for 64-bit implementations of the architecture. The 
first instruction is the divw[u][o][.] instruction from earlier in this chapter. The dividend is 
formed by sign extending the low-order 32 bits of RA, the divisor is formed by sign extending 
the low-order bits of RB, and the 32-bit quotient is formed by dividing RA by RB. The 
quotient then is placed into the low-order 32 bits of RT. The high-order bits of RT are left 
undefined. If the divwu[o]|[.] form of the instruction is used, then RA and RB are treated as 
unsigned integers. If the divw[o][.] form of the instruction is used, then RA and RB are treated 
as signed integers. The second 64-bit integer divide instruction is the divide doubleword 
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instruction (divd[u]fo]|[.]). If the divdu[o][.] form of the instruction is used, then RA and RB 
are treated as unsigned integers; if the divd[o]|[.] form of the instruction is used, then RA and 
RB are treated as signed integers. The division is performed as described earlier. ‘Table 3.6 lists 
the 64-bit divide instructions. 


Table 3.6. 64-bit divide instructions. 


Instruction Definition Format 
divide word RY <- (RA), 63 i XRB),.<3 divw[u][o][.] RT, 
RT,5, <- undefined RA, RB 
divide doubleword RT <- (RA) / (RB) divd[u][o][.] RT, 
RA, RB 


Note: If [o] is added, XER,, and XER,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 


Logical Operation Instructions 


The logical operations perform Boolean algebraic operations on the integer registers, and some 
other simple bit-manipulation operations. These instructions will execute on both 32- and 64- 
bit machines. Some of the logical operation instructions can set condition register field zero 
with a signed comparison of the result of the operation to 0. These instructions are identified 
by a dot [.] at the end of the mnemonic. None of the logical instructions updates the overflow 
bit, summary overflow bit, or carry bit in the XER. The logical operations that require two 
operands all use one integer register and either another integer register (register addressing mode) 
or an immediate field in the instruction (immediate addressing mode). The logical operations 
which only require one operand, get that operand from an integer register. 


The immediate addressing mode instructions all have the following basic format: 
Operation RT, RA, UI 


Where RT is the target register, RA is the source register, and UI is a 16-bit unsigned immedi- 
ate value. The immediate value is generally extended to the left with zeros to form an operand 
of the appropriate width (32 bits on 32-bit implementations, and 64 bits on 64-bit implemen- 
tations). The logical operation is then performed on RA and the zero-extended immediate 
operand and the result is placed into RT. 


The register addressing mode logical instructions all have the following format: 
Operation RT, RA, RB 


Where RT is the target register and is loaded with the result of the operation which is per- 
formed on the contents of register RA and the contents of register RB. 
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The single operand instructions all have the following format: 
Operation RT, RA 


RT is the target register, loaded with the results of performing the operation on the contents of 
register RA. The descriptions below are split up by logical operation. 


Boolean Operations 


The Boolean operation instructions perform the operations of Boolean algebra in a bitwise 
manner on the integer registers. This means that, for instance, That bit 0 of the target register 
is loaded with the result of the Boolean operation performed on bit 0 of the source operands, 
and bit 1 of the target register is loaded with the results of the Boolean operation performed on 
bit 1 of the source operands, and so on for each bit of the registers. Note that these operations 
correspond to C operators like & and | rather than operators like 88 and |I. 


Integer AND Instructions 


The integer AND instructions perform a bitwise AND of the two operands, There are two 
immediate form AND instructions, AND immediate (andi.), and AND immediate shifted 
(andis.). Both of these instructions use the contents of RA for their first operand. The andi. 
instruction uses the immediate field left-extended with zeros for the second operand. The andis. 
instruction uses the immediate field right-extended with 16 zeros. For 32-bit implementations, 
the 16 zeros concatenated to the 16 bits of the immediate value forms the full 32-bit operand; 
for 64-bit implementations, the 16 zeros concatenated to the 16-bit immediate field form the 
low-order 32 bits of the operand; the upper-order 32 bits of the operand are set to zero. Both 
of these instructions update condition register field zero with a sign compare of the result to 
zero. These instructions update condition register field 0, and there are no forms of these in- 
structions which do not update condition register field 0. 


There are two indexed form AND instructions, AND (and[.]), and AND with complement 
(andc[.]). These instructions both use the contents of RA and RB as the two operands, how- 


ever the andc[.] instruction inverts the contents of RB before ANDing it with RA (see 
Table 3.7). 


Integer OR Instructions 


The integer OR instructions perform a bitwise OR of the two operands. There are two imme- 
diate form OR instructions, OR immediate (ori), and OR immediate shifted (oris). Both of 
these instructions use the contents of RA for their first operand. The second operand is formed 
from the immediate field as in the andi. and andis. instructions above. 


There are two register addressing mode OR instructions, OR (or[.]), and OR with comple- 
ment (orc[.]). These instructions both use the contents of RA and RB as the two operands, 


however the orc[.] instruction inverts the contents of RB before ORing it with RA (see 
Table 3.7). 
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Integer XOR Instructions 


The integer XOR instructions perform a bitwise XOR of the two operands. There are two 
immediate form XOR instructions, XOR immediate (xori), and XOR immediate shifted (xoris). 
Both of these instructions use the contents of RA for their first operand. The second operand 
is formed from the immediate field as in the andi. and andis. instructions above. 


There is one indexed form XOR instructions, XOR (xor[.]). This instruction uses the contents 
of RA and RB as the two operands (see Table 3.7). 


EXAMPLE 3.5. 





Integer NAND Instruction 


The integer NAND instruction performs a bitwise NAND of the two operands. The only in- 
teger NAND instruction is a register addressing mode instruction and uses the contents of RA 
and RB as the two operands (see Table 3.7). 


Integer NOR Instruction 


The integer NOR instruction performs a bitwise NOR of the two operands. The only integer 
NOR instruction is a register addressing form instruction and uses the contents of RA and RB 
as the two operands (see Table 3.7). 


Integer Equivalent (XNOR) Instruction 


The integer equivalent instruction performs a bitwise XNOR of the two operands. The only 
integer equivalent instruction is a register addressing mode instruction and uses the contents 


of RA and RB as the two operands (see Table 3.7). 


Integer NOT Instruction 


The integer NOT (nor[.]) instruction negates each bit or the operand (RA) and places the re- 
sult into register RT. This is an extended mnemonic for the nor[.] instruction (see Table 3.7). 
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Miscellaneous Logical Instructions 


These two logical operation instructions are extended mnemonics based on some of the Bool- 
ean Operation instructions. 


Nop Instruction 


The no operation (nop) instruction is an extended mnemonic which generates the preferred 
form of an instruction which does nothing (see Table 3.7). 


Integer Register Move Instruction 


The integer register move (mr[.]) instruction has one operand which is copied into the target 
register. This is an extended mnemonic for the or[.] instruction (see Table 3.7). 


Integer Sign Extend Instructions 


There are two sign extend operations for 32-bit implementations of the PowerPC architec- 
ture, and one additional sign extend instruction for 64-bit implementations. The sign extend 
instructions all have one operand (the contents of RA). The sign extend byte (extsb[.]) instruc- 
tion uses the most significant bit of the least significant byte of the contents of RA asa sign bit 
and copies it into all but the least significant byte of RT. The least significant byte of RT is 
loaded with the least significant byte of RA. Thus the least significant byte of RA is sign ex- 
tended and placed into RT. The sign extend halfword (extsh{.]) instruction sign extends the 
least significant halfword of RA and places it into RT (see Table 3.7). The last sign extend 
instruction is only available on 64-bit implementations. The sign extend word (extsw[.]) in- 
struction places the sign extended least significant word of register RA into RT (see Table 3.8) 


Integer Count Leading Zero Instructions 


The integer count leading zero instructions place the number of leading zeros in the single 
operand into the target register. Both 32-bit and 64-bit implementations implement the count 
leading zeros word (cntlzw[.]) instruction. This instruction counts the number of leading ze- 
ros in the least significant word of the contents of register RA. This count is placed into RT 
(see Table 3.7). The count leading zeros doubleword (cntlzd[.]) instruction is only available 
on 64-bit implementations and counts the number of leading zeros in the doubleword con- 
tained in register RA. Again, the count is placed into register RT (see Table 3.8). 


Table 3.7. 32-bit logical operation instructions. 
Instruction Definition Format 
and immediate* RT <- (RA) & '°0 || UI andi. RT,RA,UI 
and immediate shifted? RT <- (RA) &UI II '%0 andis. RT,RA,UI 


Instruction 

and‘ 

and with complement‘ 
or immediate* 

or immediate shifted” 
or 

or with complement‘ 
xor immediate* 

xor immediate shifted” 
xor® 

nand‘ 

nor 

equivalent (xnor)° 


not 
nop 
move integer register® 


sign extend byte‘ 
sign extend halfword* 


Definition 

RT <- (RA) & (RB) 
RT <- (RA) & (RB) 
RT <- (RA) | 0 Il UI 
RT <- (RA) | UI II '90 
RT <- (RA) | (RB) 

RT <- (RA) | (RB) 

RT <- (RA) ® '0 II UI 
RT <- (RA) ® UI Il #0 
RT <- (RA) @ (RB) 
RT <- (RA) & (RB) 
RT <- (RA) | (RB) 

RT <- (RA) ® (RB) 
RT <- (RA) 


(nothing) 
RT <- (RA) 


RT<-*RA, JIRA 
RT<-'°RA, JIRA 


24..31 


16..31 


count leading zeros word“ n=0; 


while (n<32) 


in (RA _==1) break; 
else n++3 


RT<-n 
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Format 

and[.] RT,RA,RB 
andc[.] RA,RS,RB 
ori RT,RA,UI 
oris RT,RA,UI 
or[.] RT,RA,RB 
orc[.] RT,RA,RB 
xori RT,RA,UI 
xoris RT,RA,UI 
xor[.] RT,RA,RB 
nand[.] RT,RA,RB 
nor[.] RT,RA,RB 
eqv[.] RT,RA,RB 


not[.| RT, RA 
nor[.] RT, RA, RA 


nop 
ori rO, r0, O 


mr[.| RT,RA 
or RT,RA,RA 


sxtsb[.]RT,RA 
extsh[.]RT,RA 
cntlzw[.]RT,RA 


Note: If [o] is added, XER,, and XER,,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 


* For 64-bit implementations, the second operand is formed as follows: “*O__UI. 


> For 64-bit implementations, the second operand is formed as follows: 20 Ul. 


© The [.] in the mnemonic indicates that this instruction can be coded with a dot if condition register 


field 0 should be updated, or without a dot if condition register 0 should not be updated. 


4 Note that for 64-bit implementations the low order word is used , thus n starts at 32 instead of 0, 


and the while statement should read “while (n<64).” 
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Table 3.8. 64-bit logical operation instructions. 


Instruction Definition Format 
sign extend byte’ RT <- “RA, WRAL extsb[.] RT, RA 
sign extend halfword* —- RT <- “RA,, Il RA,, <, extsb[.] RT, RA 
sign extend word* RT <- *RA,, II RA. i extsw[.] RT, RA 
count leading zeros n= 0; cntlzd[.] RT, RA 
doubleword while (n < 64) 

if (RA. == 1) break; 

else n++; 

RT <-n 


Note: If [o] is added, XER,,, and XER,,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 


* The [.] in the mnemonic indicates that this instruction can be coded with a dot if condition register 
field 0 should be updated, or without a dot if condition register 0 should not be updated. 


Rotate and Shift Instructions 


The PowerPC architecture supports several rotate and shift instructions. These instructions 
use integer registers for their source operands and produce a result that then is stored back into 
an integer register. 


All rotate and shift instructions may update condition register field zero if the record form of 
the instruction is used (specified with a trailing dot (.) in the mnemonic). In these cases, the 
condition register field is set with the results of a signed comparison of the result with zero. 
The overflow bits in the XER are never changed by rotate or shift instructions. 


Shift Instructions 


Instructions are provided for shifting a register right or left. Some of the shift instructions are 
defined only for 64-bit PowerPC implementations. These instructions have two operands and 
one target. The first operand always is contained in a general integer register and is the value to 
be shifted. The second operand is the shift amount, which comes from a general integer regis- 
ter or from an immediate field in the instruction. There are signed and unsigned shift instruc- 
tions. With unsigned instructions, the bits that are vacated by shifting left or right are filled 
with zeros. Signed shift instructions are defined for right shift instructions, and the bits that 
are vacated by shifting are replaced with copies of the most significant bit of the operand (the 
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sign bit). Thus, with signed shifts, signed numbers keep the same sign. The general form for 
shift instructions follows: 


shift[.] RT, RA, RB!sh 


Unsigned Shift Instructions 


There are eight unsigned shift instructions, four of which are extended mnemonics of rotate 
instructions (rotate instructions are described later). Four of the unsigned shift instructions 
operate on words and are defined for all PowerPC implementations; the other four instruc- 
tions operate on doublewords and are defined only on 64-bit PowerPC implementations. 


The shift left word (slw[.]) and shift right word (srw|.]) instructions shift the contents of a gen- 
eral integer register (RA) by an amount specified in another general integer register (RB). The 
results are placed in a third integer register (RT). In order to derive the shift amount, the con- 
tents of RB are taken modulo 64. If the shift amount is between 32 and 63, then RT is loaded 
with 0. Zeros are filled into the vacated bits, and bits that are shifted out of the operand are 
lost. Table 3.9 lists the 32-bit shift instructions. For 64-bit machines bits 0-31 are set to 0. 


These instructions can be used for shifting data contained in multiple registers (data larger than 
32 bits, for example). (See Example 3.6.) 


EXAMPLE 3.6. 
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The shift left word immediate (slwi[.]) and shift right word immediate (srwi[.]) instructions are 
extended mnemonics of the rotate left word immediate then AND with mask (rlwinm[.]) in- 
struction. The contents of general integer register RA are shifted by the amount specified in 
the 5-bit immediate field sh. Zeros are filled into the vacated bits, and bits that are shifted out 
of the operand are lost (see Table 3.9). 


The shift left doubleword (sld{.]) and shift right doubleword (srd{.]) instructions shift the con- 
tents of a general integer register (RA) by an amount specified in another general integer reg- 
ister (RB). The results are placed in a third integer register (RT). In order to derive the shift 
amount, the contents of RB are taken modulo 128. If the shift amount is between 64 and 127, 
then RT is loaded with 0. Zeros are filled into the vacated bits, and bits that are shifted out of 
the operand are lost. These instructions are defined only on 64-bit implementations of the 
PowerPC architecture. Table 3.10 lists the 64-bit shift instructions. 


The shift left doubleword immediate (sldi{.]) and shift right doubleword immediate (srdi{.}) 
instructions are extended mnemonics of the rotate left doubleword immediate then clear right 
(rldicr[.]) instruction and the rotate left doubleword immediate then clear left (ridicl[.]) instruc- 
tion, respectively. The contents of general integer register RA are shifted by the amount speci- 
fied in the 6-bit immediate field sh. Zeros are filled into the vacated bits, and bits that are shifted 
out of the operand are lost. These instructions are defined only on 64-bit implementations of 
the PowerPC architecture (see Table 3.10). 


Signed Shift Instructions 


There are four signed shift instructions; all are shift right instructions. The shift right algebraic 
word (sraw|.]) and the shift right algebraic word immediate (srawi[.]) instructions are defined 
for all implementations of the PowerPC architecture. The shift operand (contents of RA) is 
shifted right by the shift amount. The shift amount is either the contents of the RB modulo 64 
(sraw[.] instruction), or the 5-bit immediate field sh (srawi[.] instruction). The vacated bit 
positions of the shift operand are replaced with the sign bit of that operand. If any bits are 
shifted out of the shift operand and the shift operand is negative, then the carry bit in the XER 
is set to 1, otherwise it is set to 0 (see Table 3.9). These instructions can be used to perform fast 
division by powers of 2 (see Example 3.9). 
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Table 3.9. 32-bit shift instructions. 


Instruction Definition Format 

shift left word RT <- (RA) << ((RB)%64) slw[.] RT, RA, 
RB 

shift right word RT <- (RA) >> ((RB)%64) srw[.] RT, RA, 
RB 

shift left word immediate RT <- (RA) << sh slwi[.] RT, RA, 
sh 

shift right word immediate RT <- (RA) >> sh srwi[.] RT, RA, 
sh 
rlwinm[.] RT, 
RA 

shift right algebraic word RT <- (RA) >> sh’? srawi[.] RT, 

immediate RA, sh 

shift right algebraic word RT <- (RA) >> ((RB)%64)° sraw[.] RT, RA, 
RB 


Note: If [0] is added, XER,, and XER,,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 


*XER,,, is set if any 1’s are shifted out of the operand. 


EXAMPLE 3.7. 
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The shift right algebraic doubleword (srad[.]) and the shift right algebraic doubleword immediate 
(sradi[.]) instructions are defined only on 64-bit implementations of the PowerPC architec- 
ture. These instructions behave exactly like the 32-bit signed shift instructions described ear- 
lier, except that they operate on 64-bit operands and that a shift amount from RB is modulo 


128. (see Table 3.10). 


Table 3.10. 64-bit shift instructions. 
Instruction Definition 
shift left doubleword RT <- (RA) << ((RB)%128) 


shift right doubleword RT <- (RA) >> ((RB)%128) 


shift left doubleword RT <- (RA) << sh 


immediate 


shift right doubleword RT <- (RA) >> sh 


immediate 

shift right algebraic RT <- (RA) >> sh* 
doubleword immediate 

shift right algebraic RT <- (RA) >> ((RB)%128)* 
doubleword 


Format 


sld[.] RT, RA, 
RB 


srd[.] RT, RA, 
RB 


sldi[.] RT, RA, 
sh 

rldicr[.] RT, 
RA, sh, 63—sh 
srdi[.] RT, RA, 
sh 

ridicl[.] RT, 
RA, 64—sh, sh 


sradi[.] RT, RA, 
sh 

srad[.] RT, RA, 
RB 


Note: If [o] is added, XER,, and XER,,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 
* XER_., is set if any 1’s are shifted out of the operand. 
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Rotate Instructions 


The rotate instructions are more elaborate than the shift instructions. All the rotate instruc- 
tions perform left rotation and include a masking operation. The only special purpose register 
affected by these instructions is the condition register. The rotate instructions can be used to 
perform a host of common operations and, as such, there are many extended mnemonics. 


The rotate instructions are split into 32-bit and 64-bit rotates. On 64-bit implementations of 
the PowerPC architecture, the 32-bit rotate instructions behave as if the least significant 32 
bits of the operand first are copied into the most significant 32 bits of the operand. This new 
64-bit value then is rotated. 


Two types of masking are used by these rotate instructions. The first type, AND with mask, 
ANDs the mask with the rotated result. The second type, mask and insert, inserts the rotate 
result into the target register under control of the mask. In this case, if there is a 1 in the cor- 
responding mask bit, then the bit of the rotated result is placed into the result register (at the 
same bit position); otherwise, the target register bit is left unchanged. 


There are three 32-bit rotate instructions and 12 extended mnemonics for these instructions. 
The form of all the 32-bit rotate instructions follows: 


rotate RT, RA, sh{RB, MB, ME 


The operand to be rotated is contained in RA, and the amount by which to shift left is con- 
tained in the immediate field sh or the general integer register RB. A mask is generated with 
the values contained in the immediate fields MB and ME. MB and ME are 5-bit fields, where 
MB is the first one bit of the mask and ME is the last one bit of the mask. The bits in the mask 
between MB and ME (inclusive) are set to 1, while all other bits are set to 0. If MB is greater 
than ME, then the bits between MB and bit 31 are set to 1 and the bits between bit 0 and ME 
are set to 1; again, all other bits are set to 0. The 32-bit mask generated by the MB and ME 
fields then is used to control how the rotated result is placed into the target register RT. 


The rotate left word immediate then AND with mask (rlwinm[.]) instruction ANDs the rotated 
result with the generated mask before placing it into RT. There are many uses for this instruc- 
tion, and several extended mnemonics exist to make coding these uses easier. 


The first use is extracting an n-bit field from a 32-bit register and justifying it. The extract and 
left justify word immediate (extlwi[.]) extended mnemonic performs this function with left jus- 
tification. The form of this mnemonic follows, along with the equivalent rlwinm[.] instruc- 
tion: 


extlwi[.] RT,RA,n,b = rlwinm[.] RT,RA,b,@,n-1, (n>) 


In this form, RT is the target register, RA is the source register, n is the size in bits of the field 
to be justified, and b is the starting bit for the field to be justified. The extract and right justify 
word immediate (extrwil.]) extended mnemonic performs right justification of a bit field. 
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The next use for the rlwinm instruction is for simple rotate immediate instructions. The rotate 
left word immediate (rotlwi[.]) and rotate right word immediate (rotrwi[.]) instructions also are 
extended mnemonics for the rlwinm instruction. 


Next, the rlwinm instruction can be used to clear bits in a register. The clear left word immedi- 
ate (clrlwi[.]) instruction and the clear right word immediate (clrrwi|.]) instruction are extended 
mnemonics that perform this function. The clear left immediate instruction is used to copy the 
source register to the target register and clear (set to 0) the 7 leftmost bits. The clear right im- 
mediate instruction copies the source register to the target register and clears (sets to 0) the x 
rightmost bits. 


The final extended mnemonic of the rlwinm instruction is the clear left and shift left word im- 
mediate (clrlslwi[.]) instruction, which combines the function of the clear left immediate and 
the shift left immediate instructions into a single operation: 


clrlslwi[.] RT,RA,b,n = rlwinm[.] RT,RA,n,b-n,31-n, (nsb<31) 


This instruction first sets the leftmost bits up to bit b of RA to 0 and then rotates left by n bits, 
storing the result in RT. 


The rotate left word then AND with mask (tlwnm|.]) instruction is very similar to the rlwinm 
instruction; the only difference is from where the rotate amount comes. In the rlwnm instruc- 
tion, the rotate amount comes from the 5 least significant bits of the integer general register 
RB. The only extended mnemonic for the rlwnm instruction is the rotate left word (rotlw) in- 


struction, which is a simple left rotate based on the value in RB (the least significant 5 bits or 
RB modulo 32). 


The rotate left word immediate then mask insert (rlwimil.]) instruction again is similar to the 
rlwinm instruction. The difference here is in how the mask is used. With the rlwimi instruc- 
tion, the mask is used to control the insertion of the rotate source register into the target reg- 
ister. Wherever there is a 1 bit in the mask, the corresponding bit in the target register is set to 
the value of the corresponding bit in the rotated source register. Wherever there is a 0 bit in the 
mask, the corresponding bit in the target is unchanged. 


The insert from left word immediate (inslwi[.]) instruction and the insert from right word imme- 
diate (insrwi[.]) instruction are extended mnemonics of the rlwimi[.] instruction. The form of 
the inslwi instruction follows: 


inslwi[.] RT,RA,n,b = rlwimi[.] RT,RA,32-b,b,(btn)-1, (n>) 
The n leftmost bits of RA are inserted into RT starting at bit b. Thus, 
inslwi R1,R2,3,5 


inserts the 3 leftmost bits of R2 into R1, starting at bit 5. Bit 5 of R1 is set to the value of bit 
0 of R2, bit 6 of R1 is set the value of bit 1 or R2, and so on. The insrwi instruction has the 
same form as the inslwi instruction, but the ” rightmost bits are inserted into the target register 
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starting at bit b (the most significant bit of the rightmost n bits of the source is inserted into bit 
b of the target register, and so on). Table 3.11 lists the 32-bit rotate instructions. 


Table 3.11. 32-bit rotate instructions. 


Instruction 


rotate left word immediate 


then AND with mask 


rotate left word immediate 


Definition 
RT <- RL,,*((RA), sh) & 
MSK,,(mb, me) 


RT <- RL,,((RA), n) 


rotate right word immediate RT <- RL,,((RA), 32—n) 


extract and left justify word RT <- RL,,((RA),b) & MSK,, 


immediate 


extract and right justify 
word immediate 


clear left word immediate 


clear right word immediate RT <- (RA) & MSK,,(0,31-n) 


(O,n—1) 


RT <-RL,,((RA),b+n)& 
MSK,,(32-n,31) 


RT <- (RA) & MSK, (n,31) 


Format 


rlwinm|[.] 
RT, RA, sh, 


mb, me 


rotlwi[.] 
RT, RA, n 
rlwinm|[.] 
RT, RA, n, 
O31 


rotrwi(.| 

RT, RA, n 
rlwinm|[.] 
RT, RA, 32— 
4, 0, 31 


extlwi([.| 

RT, RA, n, b 
rlwinm|[.] 

RT, RA, b, 

0, n—1 
extrwi[.] RT, 
RA,n,b, 
rlwinm|[.| RT, 
RA,b+n,32— 
n, 31 


clrlwi[.] 
RT, RA, n 
rlwinm|.] 
RT, RA, 0, 
31 
clrrwi[.] 
RT, RA, n 
rlwinm|.] 
RT, RA, 0, 
0, 31—n 


continues 
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Table 3.11. continued 

Instruction Definition Format 

clear left and shift left RT <-RL,, clrlslwif.] 

word immediate (((RA)&MSK,,(b,31)),n) RT,RA,b,n 
rlwinm|.] 
RT, RA,n,b—n, 
31—n, n<b 

rotate left word then AND RT <- RL,,((RA), (RB)%32) — rlwnm[.] RT, with 

mask & MSK,,(mb, me) RA, RB, mb, 
me 

rotate left word RT <- RL,,((RA), (RB)%32) —_rotlw[.] RT, 
RA, RB 
rlwnm|[.] RT, 
RA, RB, 0, 
31 

rotate left word immediate RT <- INS,,“(RL,,((RA),sh), — rlwimi[.] 

then mask insert (RT), mb, me) RT, RA, sh, 
mb, me 

insert from left word RT <- INS,,(RL,,((RA), inslwil[.] 

immediate 32—b)(RT), b, b+n—1) RT, RA, n, b 
rlwimi[.] 
RT, RA,32-b, 
b,b+n—1 

insert from right word RT <- INS,,(RL,,((RA), insrwil.] 

immediate 32—b),(RT), b, b+n—1) RT, RA, n, b 
rlwimil.] 
RT,RA,32—bh 
—n,b,b+n—1 


Note: If [0] is added, XER,. and XER,,,, will be set, if [.] is added, CRO will be altered based on the 
result of the instruction. 


*RL,,(A,B) is a function that rotates A to the left by a number of bits as specified B. Thus, 
RL,, (0x73333333,2) = O<CCCCCCCD. 


» MSK,,(A,B) is a function that generates a 32-bit binary mask with 1s starting at bit position A and 
ending at the bit position B. Thus, MSK,,(0,15) = OxFFFF0000, and MSK,,(24,7) = 
OxFFOOOOFF. 


“INS,,(A,B,C,D) replaces bits C[nd]D of B with bits C[nd]D of A. Thus, 


INS, (0xCCCCCCCC,0xFFFFFFFF,4,7) = OXFCFFFFFF. 
<xo-scsosezsseteilldinstivenstsaonstesssmnnsnuensisienomernsentianntonio js eines terrestres 
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There also is a set of rotate instructions defined only on 64-bit implementations of the PowerPC 
architecture. These instructions perform similar functions to the 32-bit rotates described ear- 
lier, but there are slight differences in the semantics of the instructions. The primary difference 
between these instructions and the 32-bit rotate instructions is in how the mask is generated. 
Instead of generating a mask by specifying the starting and ending bit, the 64-bit rotate in- 
structions can clear some of the leftmost bits of a rotated operand or some of the rightmost bits 
of the rotated operand. The general form of the 64-bit rotate instructions follows: 


rotate RT,RA,sh|RB,mb{me 


sh is an immediate rotate amount, while RB%64 is the rotate amount for nonimmediate ro- 
tate instructions. The mb field is used for rotate and clear left instructions (the mask has Os 
from bit 0 to bit mb, and 1s from bit mb to bit 63), while the me field is used for rotate and 
clear right instructions (the mask has 1s from bit 0 to bit me, and Os from bit me to bit 63). 


The first two 64-bit rotate instructions, rotate left doubleword immediate then clear left (r\dicl{.]) 
and rotate left doubleword immediate then clear right (rldicr[.]), can be used together to obtain 
the 64-bit equivalent of the 32-bit rlwinm[.] instruction (see Example 3.8). 


EXAMPLE 3.8. 





The third 64-bit rotate instruction is the rotate left doubleword immediate then clear (rldic|.]) 
instruction. This instruction clears the leftmost bits as in the rldicl[.] instruction, but it also 
clears the rightmost bits from bit 63-sh to bit 63. In essence, this instruction behaves similarly 
to the shift left instructions in that all bits rotated through position 0 to position 63 are cleared 
(bits that are shifted out of the most significant bit are lost, and Os are shifted into vacated bit 
positions). 
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The rotate left doubleword then clear left (r\dcl{.]) and the rotate left doubleword then clear right 


(rldcr[.]) instructions are nonimmediate analogs of the rldicl[.] and rldicr[.] instructions. 


Finally, the rotate left doubleword immediate then mask insert (rldimi[.]) instruction is similar to 
the rldcr[.] instruction, except that the mask is used to insert the rotated operand into the tar- 
get register (as in the rlwimi[.] instruction). Table 3.12 lists the 64-bit rotate instructions. 


The extended mnemonics of 64-bit rotate instructions are analogous to the 32-bit rotate ex- 
tended mnemonics. Therefore, they are not described here (they are described in Appendix B, 
“Detailed Instruction Set Reference”). The equivalent (nonextended) mnemonic is shown below 
the instruction form in Table 3.13. 


Table 3.12. 64-bit rotate instructions. 





Instruction Definition Format 

rotate left double RT <-RL,* ((RA),sh)& rildicl[.] RT,RA, 
word immediate then MSK,.,? (mb,63) sh,mb 

rotate left double RT <-RL,,((RA),sh) & rldicr[.] RT,RA,sh, 
word immediate then MSK,, (0,me) me 


clear right 


rotate left double 
word immediate then 
clear 


rotate left double 


RT <-RL,,((RA),sh)& 
MSK,, (mb,63—sh) 


RT <-RL,,((RA),(RB)%64) 


rldic[.] RT,RA,sh, 
mb 


rldcl[.] RT,RA,RB, 


word then clear left & MSK,, (mb,63) mb 

rotate left double RT <-RL,,((RA),(RB)%64) rldcr[.] RT,RA,RB, 
word then clear & MSK,, (0,me) me 

right 

rotate left double RT <-INS.<(RL,,((RA),sh), rldimi[.] 

word immediate then (RT), mb,63—sh) RT,RA,sh, 

mask insert mb 


*RL,(A,B) is a function that rotates A to the left by a number of bits as specified B. Thus, 


RL, (0X73333333 33333333,2) = OXCCCCCCCC CCCCCCCD. 


*MSK,,(A,B) is a function that generates a 64-bit binary mask with 1s starting at bit position A and 
ending at the bit position B. Thus, MSK,,(0,15) = 0xFFFF0000 00000000, and MSK,,(56,7) = 


0xFF000000 OOO000FF. 


“INS,.(A,B,C,D), replaces bit C-D of B with bits C-D of A. Thus 
INS..(OxCCCCCCCC CCCCCCCC,0OxFFFFFFFF FFFFFFFF4,7) = 0<FCFFFFFF FFFFFFFE. 


Note: If [o] is added, XER,, and XER,,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 





Table 3.13. 64-bit rotate instructions (extended mnemonics). 


Instruction 


rotate left doubleword 
immediate 


rotate right doubleword 
immediate 


extract and left justify 
doubleword immediate 


extract and right justify 
doubleword immediate 


clear left doubleword 
immediate 


clear right doubleword 
immediate 


clear left and shift left 


doubleword immediate 


rotate left doubleword 


Definition 
RT <- RL,*((RA), n) 


RT <- RL,,((RA), 64-n) 


RT <- RL,,((RA),b) & 
MSK,,°(0, n—1) 


RT <- RL, ((RA),b+n) & 
MSK,,(64—n, 63) 


RT <- (RA) & MSK,,(n,63) 


RT <- (RA) & MSK,,(0,63-n) 


RT <- RL, 
(((RA)&MSK_.,(b,63)), n) 


RT <- RL,,((RA), (RB)%64) 
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Format 


rotidi|. | 

RT, RA, n 
ridicl|[.] 

RT, RA, n, 0 
rotrdi[. | 

RT, RA, n 
ridicl{.] 

RT, RA, 
64— 

n, O 


extldi[.] 

RT, RA, n, b 
rldicr[.] 

RT, RA, b, 
n—1 

extrdi[.] 

RT, RA, n, b 
ridicl|[.] 

RT, RA, 
b+n,64—n 
clridi[.] 

RT, RA, n 
ridicl|[.] 

RT, RA, 0, n 
clrrdi[.| 

RT, RA, n 
rldicr[.] 

RT, RA, 0, 
63—n 
clrisldi[.] 
RT, RA, b, n 
rldic[.] RT, 
RA, n, 

b—n, n<b 
rotid[.] RT, 
RA, RB 


continues 
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Table 3.13. continued 

Instruction Definition Fermat 
ridcl[.] RT, 
RA, RB, 0 

insert from right doubleword RT <- INS,,(RL,,((RA), insrdil.] 

immediate 64—b),(RT), b, b+n—1) RT, RA, n, b 
rldimi[.] 
RT, RA, 
64—b—n, b 


*RL.(A,B) is a function that rotates A to the left by a number of bits as specified B. Thus, RL,, 
(0X733333333 33333333,2) = OXCCCCCCCC CCCCCCCD. 


*MSK,,(A,B) is a function that generates a 64-bit binary mask with 1s starting at bit position A and 
ending at the bit position B. Thus, MSK,,(0,15) = 0xFFFF0000 00000000, and MSK,,(24,7) = 
0xFFO00000 OOO000FF. 


“INS,,(A,B,C,D) replaces bit C[nd]D of B with bits C[nd]D of A. Thus, 
INS, (OxCCCCCCCC CCCCCCCC,0xFFFFFFFF FFFFFFFF,4,7) = 0<xFCFFFFFF FFFFFFFF. 


Move To/From Special Register Instructions 


The only special purpose register described in this section is the fixed point exception register 
(XER). There are two instructions that can be used to move to and from special purpose reg- 
isters. The move to special purpose register (mtspr) instruction is used to copy the contents of an 
integer register into a special purpose register. The move from special purpose register (mfspr) 
instruction is used to copy the contents of a special purpose register into an integer register. 
The form of these instructions follows: 


move RT, SPR 


The RT field specifies a general integer register that is the source or target of the special pur- 
pose register value (depending on whether a move to or move from instruction is being coded). 
The SPR field specifies one of 34 special registers. Only three of the 34 special registers gener- 
ally are used by application code; the other registers are used primarily by operating systems 
(see Appendix C, “Operating System Design for PowerPC Processors”). The three registers 
that are used by application code are the XER, the link register, and the count register. The 
link and count registers are described with the branch instructions. The SPR code for the XER 
register is 1. There also are extended mnemonics for the XER encoding. The move to XER (mtxer) 
and move from XER (mfxer) instructions are extended mnemonics used to move values to and 
from the XER. Table 3.14 lists the move to and from special purpose register instructions. 
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Table 3.14. Move to and from special purpose register instructions. 


Instruction Definition Format 

move to special purpose SPR <- (RT) mtspr SPR, RT 
register 

move from special purpose RT <- (SPR) mfspr RT, SPR 
register 

move to fixed point exception XER <- (RT) mtxer RT 
register mtspr 1, RT 
move from fixed point RT <- (XER) mfxer RT 
exception register mfspr RT, 1 


Load and Store Instructions 


Load and store instructions are used to access memory. Because the PowerPC architecture is a 
load/store architecture, these are the only instructions that can manipulate memory. The 
PowerPC architecture supports several data types natively. For the integer processor, byte, 
halfword (16-bit), word (32-bit), and doubleword (64-bit) load and store instructions are pro- 
vided (for both signed and unsigned data). For the floating-point processor, both IEEE single- 
and double-precision load and store instructions are provided. 


Address Generation 


Address generation is performed in the integer processor. The programmer generally sees a flat 
32-bit address space (or a 64-bit address space for 64-bit implementations). The operating system 
manages a 52-bit virtual address space (or an 80-bit virtual address space for 64-bit implemen- 
tations) so that each program gets its own flat 32-bit address space (called an effective address 
space). Programs therefore cannot interfere with each other and, in general, the programmer 
can ignore any other application that may be running on the system. For a discussion of virtual 
memory, see Appendix C, Throughout this section, only references to effective addresses are 
made, which are what the programmer sees. 


An effective address can be calculated in one of two ways. The first method is to add the con- 
tents of a general integer register (the base register) to a constant offset (contained in a 16-bit 
immediate field in the instruction). The second method, indexed address generation, is to add 
the contents of a general integer register (the base register) to the contents of another general 
integer register (register addressing mode). Most of the load and store instructions give the option 
of storing the calculated effective address in the base register—this form is called the update 
form of the instruction. 
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Sometimes it is desirable to use 0 instead of a base register. The architecture supports this by 
using 0 instead of the contents of register zero if it is specified as the base register. 


Endianness 


Endianness refers to the way in which bytes are stored in memory. Bytes are stored in two ways. 


The first method, big endian, stores the most significant byte of a multibyte value in the low 
address and the least significant byte in the high address. Therefore, if a 4-byte integer is stored 
at address 0X20000000, then the most significant byte is at 020000000 and the least signifi- 
cant byte is at 0X20000003. The 68000 architecture is an example of a big endian architec- 
ture. 


The second way to store bytes in memory is little endian. In this case, the most significant byte 
of a multibyte value is stored in the high address, while the least significant byte is stored in the 
low address. Thus, if a 4-byte integer is stored at address 0X20000000, then the most signifi- 
cant byte can be read from address 020000003, while the least significant byte can be read 
from 020000000. Intel’s x86 architecture is an example of a little endian architecture. 


Often, programs are not endian-independent, and this can lead to problems when porting code 
from one platform to another or when data is generated on one platform and read on another. 
In general, endian problems arise whenever data is manipulated in multiple sizes (a 32-bit in- 
teger which also is manipulated as an 8-bit character or string of 8-bit characters, for example). 


Suppose that a program uses the low order 24 bits of a 32-bit integer to store data, and the 
upper order 8 bits of the integer to store a flag identifying what the data is. The program was 
written on a little endian machine that supported byte and word (4-byte) loads. In this pro- 
gram, whenever the flag is needed to identify the data, it is loaded by issuing a load to the ad- 
dress of the integer plus 3 (in order to get the high-order byte). Whenever the integer data and 
flag are wanted, the load address is the address of the integer. When this program is compiled 
on a big endian machine, the loads that are supposed to load the flag will fail. In the big endian 
machine, the address of the integer plus 3 will return the least significant byte instead of the 
most significant byte. In fact, all the byte loads that address a flag in an integer must be changed 
to the address of the integer in order for the program to work (see Figure 3.2). 


FIGURE 3.2. 0x20000000 0x20000004 
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The PowerPC architecture uses big endian addressing, but it has support for little endian pro- 
grams. This support comes in two forms. The first form is the byte reverse load instruction, 
which reverses the byte order of a word coming into the processor. The second form is a little 
endian addressing mode, which is described in Appendix E, “Portability Notes.” 


Integer Load Instructions 


Four data sizes are supported for the integer processor. Three are available on any PowerPC 
implementation: byte (8 bits), halfword (16 bits), and word (32 bits). The last size, doubleword 
(64 bits), is supported only on 64-bit implementations of the PowerPC architecture. When 
the data is loaded into a general integer register, it always is right justified (it occupies the least 
significant bits). These data sizes can be loaded as signed or unsigned values. Signed values are 
sign extended to the register size (32 bits for 32-bit implementations, 64 bits for 64-bit imple- 
mentations); unsigned values are zero extended to the register size. 


The general form of the integer load instructions follows: 
load[u][x] RT, RA, D ! RB 


The address is calculated by adding RA to the 16-bit immediate offset D or the 32-bit or 64- 
bit index contained in register RB. The value at that address is loaded into the integer register 
RT. If the u form of the instruction is used, the RA is updated with the calculated effective 
address. If the x form is used, then RB is used to calculate the effective address; otherwise, the 
immediate offset D is used. Note that the offset form of the instruction generally is written as 
the following: 


load[u] RS, D(RA) 


For all loads except the update forms, if the register 0 is specified for RA, then the value 0 is 
used instead of the contents of r0. For the load with update instructions, specifying r0 for RA 
is an invalid form. 


There is one byte load instruction: load byte and zero extend (\bz[u][x]). It is an unsigned load. 
The byte at the calculated address is loaded into the least significant byte of the target register 
and the rest of the register is filled with zeros. In order to get a signed byte, the extend sign byte 
(extsb[.]) must be used following the byte load. 


There are two halfword load instructions. The first instruction, load halfword and zero (\hz[u] [x]), 
is an unsigned load. The second instruction, load halfword algebraic (\ha[u][x]), is a signed load. 
The most significant bit in the halfword is repeated in the upper-order 16 bits of the target 
register. 


There are two word load instructions, but only one is defined for 32-bit implementations. The 
load word and zero instruction (lwz[u][x]) is defined for all implementations of the PowerPC 
architecture. On 32-bit implementations, the 32-bit word is loaded into the target register (no 
zero extension is needed because the 32-bit word consumes the entire register). Table 3.15 lists 
the 32-bit integer load instructions. 
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Table 3.15. 32-bit integer load instructions. 


Instruction Definition* Format’ 
load byte and zero RT <- 074 lbz[u] RT, D(RA) 
MEM(sign_ext(D)+(RA) 
10, 1) 
load byte and zero indexed RT <- 0% || MEM((RB) lbz[u]x RT, RA, 
| +(RA)IO,1) RB 
load halfword and zero RT <- 0'° Ii lhz[u] RT, D(RA) 
MEM(sign_ext(D)+(RA) 
10, 2) 
load halfword and zero RT <- 0!° || MEM((RB) lhz[u]x RT, RA, 
indexed +(RA)I0,2) RB 
load halfword algebraic RT <- sign_ext( lhafu] RT, D(RA) 
MEM(sign_ext(D)+(RA) 
lO, 2)) 
load halfword algebraic RT <- sign_ext( lha[u]x RT, RA, 
indexed MEM((RB)+(RA)IO, 2)) RB 
load word and zero RT <- MEM( lwz[u] RT, D(RA) 
sign_ext(D)+(RA) 
lO, 4)) 
load word and zero indexed RT <- MEM((RB)+(RA)?(RA) Ilwz[u]x RT, RA, 
:0, 4)) RB 


Note: If [o] is added, XER,, and XER,,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 
*MEM(A,B) is a function that points to B sequential bytes in memory starting at address A. 


>[u] in the mnemonic means that the instruction can be coded with the u, meaning update form, or 
without the u, indicating no update form. In the update form RA is loaded with the effective 
address of the load. Note that using 0 for RA with the update form of the instruction is invalid and 
results in an exception. 


On 64-bit implementations, the lwz[u][x] instruction loads the 32-bit word into the low- 
order 32 bits of the target register, and then loads Os into the upper-order 32 bits of the regis- 
ter. In addition, on 64-bit implementations, there is another load word instruction. The load 
word algebraic (\wa[[u]x]) instruction is defined only for 64-bit implementations of the archi- 
tecture. The sign bit of the word to be loaded is repeated throughout the upper-order 32 bits 
of the target register. There is a semantic difference between the offset value for this instruc- 
tion and the previous ones. The offset is multiplied by 4 (shifted left by 2) before being added 
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to the base register. Note that this only applies to the offset value—not to the index value. An- 
other difference is that although the u and x forms of this instruction are as described earlier, 
the lwau form does not exist. 


The final integer load instruction is the load doubleword instruction (ld[u][x]) and is available 
only on 64-bit implementations of the architecture. This instruction loads the doubleword at 
the effective address of the load into the target register. There is no sign or zero extension be- 
cause the doubleword consumes all the bits in the target register. This instruction also multi- 
plies the offset (not index) value by 4 before performing the address calculation. Table 3.16 
lists the 64-bit integer load instructions. 


Table 3.16. 64-bit integer load instructions. 








Instruction Definition Format’ 

load word algebraic RT <- sign_ext lwa RT, 
(MEM*(sign_ext(DSII0b00)+ =DS(RA) 
(RA)IO, 4))) 

load word algebraic indexed RT <- sign_ext lwa[u]x RT, 
(MEM((RB)+(RA)IO, 4))) RA, RB 

load doubleword and zero RT <- MEM Ildz[u] RT, 
(sign_ext(DSII0b00)+ DS(RA) 
(RA)IO, 8)) 

load doubleword and zero RT <- MEM((RB)+(RA) Idz{u]x RT, 

indexed lO, 8)) RA, RB 


Note: If [o] is added, XER,, and XER,,,, will be set, if [.] is added, CRO will be altered based on the 

result of the instruction. 

*MEM(A,B) is a function that points to B sequential bytes in memory starting at address A. 

> [u] in the mnemonic means that the instruction can be coded with the u, meaning update form, or 
without the u, indicating no update form. In the update form RA is loaded with the effective 
address of the load. Note that using 0 for RA with the update form of the instruction is invalid and 


results in an exception. 





Integer Store Instructions 


The same four data sizes that were supported for integer loads are supported for integer stores. 
There is no distinction, however, between signed and unsigned data. The form of store in- 
structions follows: 


store[u][x] RS, RA, D | RB 
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The address is calculated by adding RA to the 16-bit immediate offset D or the 32-bit index 
contained in register RB. The value to be stored is contained in the integer register RS. If the 
u form of the instruction is used, the RA is updated with the calculated effective address. If the 
x form is used, then RB is used to calculate the effective address; otherwise, the immediate offset 
D is used. Note that the offset form of the instruction generally is written as the following: 


store[u] RS, D(RA) 


If register zero is specified for RA, except for the store with update instructions, for which set- 
ting RA to r0 is an invalid form, then the value 0 is used instead of the contents of RA. There 
are three integer store instructions specified for 32-bit implementations. The store byte instruc- 
tion (stb[u][x]), the store halfword instruction (sth[u][x]), and the store word instruction 
(stw[u][x]) are available on any PowerPC implementation. The store byte instruction stores 
the least significant byte of RS in the memory location specified by the address calculation. 
The store halfword instruction stores the least significant halfword contained in RS in the ad- 
dress specified by the effective address calculation (the most significant byte of the halfword is 
stored at the calculated address, and the least significant byte is stored at the address plus 1). 
The store word instruction stores the least significant word in register RS starting at the calcu- 
lated address (the last byte of the word is stored at the address plus 3). Table 3.17 lists the 32- 


bit integer store instructions. 


Table 3.17. 32-bit integer store instructions. 


Instruction Definition * Format? 
store byte MEM(sign_ext(D)+(RA)_0, 1) <- stb[u] RS, 
RS, 451 D(RA) 
store byte indexed MEM((RB)+(RA)_0,1) <- RS,,,, stb[u]x RS, 
RA, RB 
store halfword MEM(sign_ext(D)+(RA)_0, 2) <- sth[u] RT, 
RS...) D(RA) 
store halfword MEM((RB)+(RA)_0,2) <- RS... sth{uJRT,D 
indexed (RA) 
store word MEM(sign_ext(D)+(RA)_0, 4)) <- sth[u]x RT, 
(RS) RA, RB 
store word indexed MEM((RB)+(RA)_0, 4)) <- (RS) stw[u] RT, 
D(RA) 


Note: If [o] is added, XER,, and XER,,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 


*MEM(A,B) is a function that points to B sequential bytes in memory, starting at address A. 
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»{u] in the mnemonic means that the instruction can be coded with the u, meaning update form, or 
without the u, indicating no update form. In the update form RA is loaded with the effective 
address of the load. Note that using 0 for RA with the update form of the instruction is invalid and 
results in an exception. 


aE 


There is one more integer store instruction that is available only on 64-bit implementations. 
The store doubleword instruction (std[{u][x]) stores the double word contained in register RS 
starting at the calculated effective address. This instruction multiplies the offset contained in 
the D field of the instruction by 4 before calculating the effective address (only for non-x forms 
of the instructions). Table 3.18 lists the 64-bit integer store instructions. 


Table 3.18. 64-bit integer store instructions. 





Instruction Definition* Format? 

store doubleword MEM(“sign_ext(DSIIOb00)+ — std[u] RT, 
(RA)_0,8)) <- (RS) DS(RA) 

store doubleword indexed _MEM((RB)+(RA)_0, 8)) std[u]x RT, 
<- (RS) RA, RB 


Note: If {o] is added, XER,, and XER,,,, will be set, if [.] is added, CRO will be altered based on the 


result of the instruction. 


*MEM(A,B) is a function that points to B sequential bytes in memory, starting at 
address A. 


> {u] in the mnemonic means that the instruction can be coded with the u, 
meaning update form, or without the u, indicating no update form. In the 
update form RA is loaded with the effective address of the load. Note that using 
0 for RA with the update form of the instruction is invalid and results in an 


exception. 





Special Integer Storage Operations 


The special integer load/store instructions fall into three categories. The first category is the 
byte reversal instructions, which give rudimentary support for little endian data. The second 
category is load and store multiple instructions, which are used primarily for restoring the state 
of the integer registers from the stack (after a subroutine call). The final category is the move 
assist instructions, which give basic support for string manipulation. 


There are two load byte reversal instructions. The load halfword byte reverse indexed (\hbrx) and 
the load word byte reverse indexed (\wbrx) instructions. These instructions are analogous to the 
lhzx and lwzx instructions, respectively. The difference is that the bytes are loaded in 
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reverse order. In other words, the most significant byte is placed in the least significant byte 
position in the target register, and the least significant byte is placed in the most significant 
byte position, which is loaded into the target register (see Figure 3.3). 


Memory image 


0000 0000 | o0000000 | MSB 


Register loaded with the Ihzx instruction 


To000 0000 | coco ao00 | 2nse 


Register loaded with the lhbrx instruction 


|_emss | ase | ise 


Register loaded with the lwbrx instruction 


isp] asp | 2MsB 


Register loaded with the Iwbrx instruction 


FIGURE 3.3. 

Load byte reverse 
instructions (all instructions 
address MSB in memory). 


There are two store byte reversal instructions, which are analogous to the load byte reversal 
instructions. The store halfword byte-reverse indexed instruction (sthbrx) stores the low-order 2 
bytes of the source operand (RS) after first reversing the order of the 2 bytes. The store word 
byte-reverse indexed instruction (stwbrx) stores the low-order word of the source operand (RS) 
after first reversing the order of the 4 bytes. 


The load and store multiple instructions can manipulate blocks of memory that span many 
words. The load multiple word instruction (lmw) has the following form: 


lmw RT, D(RA) 


The effective address of the load is calculated by adding the contents of RA to the immediate 
field D. The low-order 32 bits of registers RT—31 are loaded with sequential words in memory, 
starting at the effective address of the load. The store multiple word instruction (stmw) is the 
dual of the Imw instruction. That is, the low-order 32 bits of registers RS—31 are stored into 
sequential words in memory, starting at the effective address of the store (RA + D). Table 3.19 
lists the 32-bit special integer load/store instructions. 


Instruction 
load halfword byte reverse 


indexed 


load word byte reverse 
indexed 


store halfword byte reverse 


indexed 


store word byte reverse 
indexed 


load multiple word? 


store multiple word 


Table 3.19. 32-bit special integer load/store instructions. 


Definition 

RT <- 0'° || 
MEM(1+(RB)+(RA)IO, 1) 
IIMEM((RB)+(RA)IO,1) 
RT <- MEM(3+(RB)+(RA) 
10,1) Il 
MEM(2+(RB)+(RA) 
10,1) I 
MEM(1+(RB)+(RA)IO,1) 
|| 

MEM((RB)+(RA)IO, 1) 
MEM((RB)+(RA)IO,4) <- 


RS 4.51 | IRS 6.5! | 

RS, 0:7 
MEM((RB)+(RA)IO0,4) <- 
RS, ,5,!1RS..,/IRS,.,, 
IRS, . 

RTs. R31).5, al 


MEM(sign_ext(D)+(RA)IO, 
4*(32-’RT®)) 
MEM(sign_ext(D)+(RA)IO, 
4*(32-’RS)) <- RS, ,, : 

R31 


0:31 
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Format 
lhbrx RT,RA,RB 


lwbrx RT,RA,RB 


sthbrx RT,RA,RB 
stwbrx RT,RA,RB 
Imw RT,D(RA) 


stmw RT,D(RA) 


*MEM(A,B) is a function that points to B sequential bytes in memory, starting at address A. 


’On 64-bit implementations, the load multiple instruction loads only the lower-order 32 bits of 


each target register (RT). The upper bits are set to zero. 


©RT,.,, refers to the low-order 32 bits of the register on 64-bit implementations or the entire register 


0:31 
on 32-bit implementations. 


“’RT refers to the register number rather than the contents of that register. 


© On 64-bit implementations the store multiple instruction stores only the lower-order 32 bits of 


each store data source register (RS). The upper bits are ignored. 
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The last four integer load/store instructions are the move assist instructions, which manipulate 
strings of bytes which may be longer than eight bytes (these instructions also may update multiple 
registers). There are two load string instructions. These instructions are similar to the load 
multiple word instructions, in that they update multiple registers, but instead of specifying the 
number of registers to be updated, the number of bytes to be loaded is specified. As many reg- 
isters as are necessary are loaded (starting at RT, and possibly wrapping around through RO). 
Again, only the low-order 32 bits of each target register are loaded. The load string word imme- 
diate instruction (lswi) specifies the number of bytes to load in an immediate field. If 0 bytes is 
specified, then 32 bytes are loaded. The /oad string word indexed instruction (|swx) specifies the 
number of bytes to load in bits 25 through 31 of the XER. The store string word immediate 
(stswi) and store string word indexed (stswx) instructions are analogous to the Iswi and Iswx in- 
structions, respectively. Table 3.20 lists the integer move assist instructions. 


Table 3.20. Integer move assist instructions. 
Instruction Definition” Format 
load string word immediate for (i=0; i< (NB?NB:32); i++) Iswi RT,RA, 


{ NB 
‘((RT +i/4)%32) 


i ((i%4)*8): 

((i94)*8+7) 

‘(RT +i/4)%32), 
b 


:(((1%4)*8+7)+32) 


<- MEM‘*(((RA)?(RA):0)+i, 1) 


(i%4)*8+32) 


j 
‘(RT+NB/4) <- 0 


(NB%4)*8:63 

load string word indexed for (i=0; i< XER,, , |; i++) Iswx RT,RA, 
{ RB 
‘((RT+i/4)%32) ciop4eg) 
((i94)*8+7) 
‘((RT+i/4)%32) 
:(((19%4)"8+7)+32) 
<- MEM((RB)+((RA)?(RA):0) 
+i, 1) 
} 
‘(RT+NB/4) 
"8:63 < 0 

store string word immediate _for (i=0; i< (NB?NB:32); i++) stswi RT,RA, 
{ RN 
MEM(((RA)?(RA):0)+i, 1) <- 
‘((RS+i/4)%32) 


((i%4)*8+32) 


(XER25:31%4) 


((i%4)*8+32) 
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Instruction Definition Format 





:(((1%4)*8+7)+32) 


store string word indexed for (i=0; i< XER,,.,.3 i++) stswx RT,RA, 
RB 
MEM ((RB)+((RA)?(RA):0)+i, 1) <- 
((RS+i/4)%32) ig, pres 
(((194)*8+7)+32) 


*MEM(A,B) is a function that points to B sequential bytes in memory, starting as address A. 


* These instructions only operate on the low order 32 bits of the registers in a 64-bit implementation. 
The equations are given for a 32-bit implementation with the bit numbers ranging from 0 to 31. 
For a 64-bit implementation the bit numbers range from 32 to 64 (simply add 32 to the bit 


numbers given). 


“RT refers to the register number rather than the contents of that register. 
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Branch and Control 
Flow Instructions 
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The branch and control flow instructions are used to control the execution of a program. These 
instructions make things like subroutine calls and for-next loops possible. The PowerPC ar- 
chitecture supports several different branch instructions, all of which direct program execution 
to one of two possible addresses. The sequential address is simply the address of the branch plus 
4 (the instruction immediately following the branch in memory). The target address is gener- 
ated by the branch and can be any address in the memory space. If the target address is se- 
lected, then the branch is said to be taken; if the sequential address is selected, then the branch 
is said to be not taken. Throughout this chapter, references are made to the program counter. 
This is not an architectural register in the PowerPC architecture; instead, it is a convenient 
way to refer to the address of the instruction being executed. Remember that from the 
programmer's point of view, the instructions execute one at a time, but in reality the hardware 
may be executing many instructions at once. 


There are four types of branches as classified by how the target address is calculated. The first 
type, relative branch, generates the target address by adding an immediate offset to the current 
program counter (the address of the branch instruction in memory). The second type, absolute 
branch, generates the address by using an immediate field in the instruction directly. The third 
type, branch to link or return branch, uses the contents of the link register as a branch target. 
The fourth and final type of branch target, branch to count, uses the contents of the count reg- 
ister as the target address of the branch. Each of these branches has a purpose, or a set of se- 
mantics that programmers should follow. These semantics are described in this chapter. 


It also is possible to classify branches based on how the selection between the target and the 
sequential address is made. The simplest type of branch in this regard is the unconditional branch. 
Unconditional branches always select the target address. The next type of branch, branch and 
decrement branches, subtracts 1 from the value in the count register and then compares the new 
value of the count register to 0. The address selection is made based on the results of the com- 
pare. Finally, any bit in the condition register can be examined and the address selection can be 
made based on the value of that bit. Note that the condition register and count register tests 
can be combined into a single test. Also note that unlike many other architectures, it is not 
possible to base the branch address selection or the target address of the branch directly on an 
integer register. 


Branch Instruction Descriptions 


There are four branch instructions, three of which have many forms. There also are many 
extended mnemonics, which make coding branch instructions much easier. The four 
instructions are described in detail in this chapter, and Tables 4.2 and 4.4 list the extended 
mnemonics. 


All branch instructions may update the link register with their sequential address. This is coded 
with an [I] in the instruction mnemonic. The link register is used for subroutine linkage, so 
when a subroutine is being called, the link register should be updated. The branch to link reg- 
ister instruction then can be used to return from the subroutine (see Example 4.1). 
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When coding relative and absolute branches, you typically use a label, rather than an actual 
address or offset. For instance, in Example 4.1, the first instruction is coded as bl) SUBROU- 
TINE instead of using some exact offset. The assembler calculates and inserts the exact offset. 


The unconditional long branch instruction (b[I][a]) is an unconditional branch with a 24-bit 
immediate field. The b[l]a form of the instruction uses the immediate field as an absolute ad- 
dress, and the b[I] form of the instruction uses the immediate field as a signed offset from the 
current program counter. Before being used, the immediate field has two binary zeros appended 
to the right (the address/offset is multiplied by 4). This gives a word-aligned address, which is 


required for instructions in the PowerPC architecture. 


The branch conditional (bc[l][a]) instruction has many options. The form of the instruction 
follows: 


bc BO, BI, BD 


The BD field is a 16-bit immediate field, and is used as an absolute address (bc[l]a form) ora 
signed offset from the current program counter (be[I] form). The BO field is a 5-bit immedi- 
ate value that describes how the address selection of the branch is performed (see Table 4.1). 


Table 4.1. BO field encodings for branch conditional, branch conditional to count, and branch 
conditional to link instructions. 


Encoding“ Definition 
Ob0000yb Decrement the count register, then branch if the new 
value in the count register is not 0, and the condition‘ 


is false. 


continues 
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Table 4.1. continued 





Encoding * Definition 

Ob0001ly Decrement the count register, then branch if the new 
value in the count register is 0, and the condition is 
false. 

0b0010y Branch if the condition is false. 

0b0100y Decrement the count register, then branch if the new 
value in the count register is not 0, and the condition 
is true. 

Ob0101y Decrement the count register, then branch if the new 
value in the count register is 0, and the condition is 
true. 

0b0110y Branch if the condition is true. 

0b1000y Decrement the count register, then branch if the new 
value in the count register is not 0. 

Ob1001ly Decrement the count register, then branch if the new 
value in the count register is 0. 

0b10100 Branch always. 


* These are the only valid BO field encodings. 
> The y-bit is the bit that reverses the default branch prediction. 


© The condition is determined by the bit in the condition register specified in the BI field. A value of 
0 is a false condition, and a value of 1 is a true condition. 





Finally, the BI field is a 5-bit immediate field that codes which bit in the condition register 
should be examined if the condition register actually is examined art all. 


The branch conditional to link register (bclr[l]) and branch conditional to count register (bcctr{I}) 
instructions are similar to the branch conditional instruction, except that the link register or 
the count register is used as the target address. The form of these instructions follows: 


branch BO, BI 


The BO and BI fields are as defined earlier for the branch conditional instruction, with the 
exception that the branch and decrement forms are not valid with the branch to count register 
instruction. 


At this point, a word about hardware implementation is in order. When a processor is execut- 
ing an instruction stream, it is common for it to have many instructions executing simulta- 
neously, Often, when a branch instruction is encountered that needs to examine a condition 
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register bit, that condition register bit has not yet been updated by the instruction that sets it. 
When this occurs, the hardware often tries to predict the direction the branch will take so that 
it can continue to fetch and execute instructions while waiting for the condition register to be 
set. If, when the condition register finally is updated, the prediction is determined to be cor- 
rect, then the processor continues to fetch and execute instructions along the control path it 
predicted. If, on the other hand, the branch was predicted incorrectly, then the instructions 
that were being fetched and executed along the predicted control path must be purged (no 
effects of these instructions are visible to the programmer) and the correct instructions must be 
fetched and executed. The accuracy of this prediction can have a significant effect on the ex- 
ecution time of a program. 


The PowerPC architecture enables the programmer to give hardware a hint about which way 
it should predict a conditional branch. Normally, branches with negative displacement will be 
predicted as taken, while those with positive displacement will be predicted as not taken. The 
least significant bit of the BO field enables the programmer to change this prediction scheme. 
If this bit is set to 1, then the default prediction is reversed; if the bit is 0, then the prediction 
is not reversed. Note that hardware is not required to use this prediction scheme, but it is en- 
couraged to use it (all current PowerPC implementations default to this scheme, although some 
have dynamic prediction schemes as well). 


The branch and decrement options (see Table 4.1) are used for coding loops. These options 
use the count register as the iteration variable and in a single instruction close the loop and 
decrement that iteration variable (see Table 4.2). 


EXAMPLE 4.2, 
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There are many extended mnemonics for the three conditional branch instructions, which make 
coding these branches simpler. Rather than list every one of the extended mnemonics in a single 
list, this section shows how they are constructed and lists them in Table 4.2. 


The first set of extended mnemonics gives a shortcut to coding different BO fields (see Table 


4.2). The following shows the form of these instructions: 


branch BI, target # for absolute branches 


branch BI, offset # for relative branches 


branch BI # for branches to either the link or count registers 


Table 4.2. Extended mnemonics for the different BO field encodings. 


Target Address Type 


(bc) (bca) (bclr) (bcctr) 
Branch Semantics (X*) relative absolute to link to count 
branch unconditionally — = blr{l] nctr{I] 
branch if condition br[I] be[l]a belr[I] btctr[I] 
true (t) 
branch if condition | bf[I] bf[l]a bflr{I] bfctr[]] 
false (f) 
decrement count and bdnz[I] bdnz{lla bdnzlr[l] she 


branch if count non-zero 


(dnz) 


Branch Semantics (X*) 


decrement count and 


(bc) 


relative 


bdz{I] 


Target Address Type 


(bca) 


absolute 


bdz{[l]a 


(bclr) 
to link 


bdzlr{]] 
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(bcctr) 


to count 


branch if count zero 

(dz) 

decrement count and branch 
if count non-zero and 
condition true (dnzt) 


bdnzt{1] bdnzt([l]a bdnztlr{I] — 


decrement count and branch bdnzf{[I] bdnzf{[l]a bdnzflr{I] _ 
if count non-zero and 


condition false (dnzf) 
decrement count and branch bdzt[I] bdzt[l]a bdztlr[I] — 
if count zero and 

condition true (dzt) 

bdzf[l]a bdzflr[I] _ 


decrement count and branch bdzf]l] 
if count non-zero and 


condition false (dzf) 


* The building block used to form the mnemonic is shown in parentheses beside the semantic 
description. The mnemonic for the branch is built out of three components: b[direction mne- 


monic] [target mnemonic]. 


The second set of extended mnemonics for the branch instructions has to do with coding the 
BI field for the branch if condition false and branch if condition true cases. The condition reg- 
ister is conceptually split into eight 4-bit fields. Each of these 4 bits has a meaning related to 
how it is set by the various instructions. When an integer instruction sets a condition register 
field (for example, add. automatically updates condition register field zero with a compare of 
the result to 0), the 4 bits mean Jess than, greater than, zero, and summary overflow, respectively. 
These mnemonics are constructed by concatenating to the branch mnemonic (b) a condition 
register bit code (see Table 4.3), followed by a target code (see Table 4.4). The form of these 


instructions follows: 
b[CR code][target code][1l1] [CR field], [target] 


The CR field operand specifies which of the eight condition register fields should be exam- 
ined, and the target is specified for relative branches (as a signed offset) or absolute branches 
(as an actual address). If the CR field is left out of the instruction, then condition register field 
zero is examined. 
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Table 4.3. Condition register bit codes. 


CR Code Bit 
lt 0 
le ] 
eq 2 
ge 0 
gt 1 
nl 0 
ne 2 
ng 
sO 3 
ns 3 
un 3 
nu 3 


True or False 


true 
false 
true 
false 
true 
false 
false 
false 
true 


false 


true 


false 


Meaning 

branch if less than 

branch if less than or equal to 
branch if equal to 

branch if greater than or equal to 
branch if greater than 

branch if not less than 

branch if not equal to 

branch if not greater than 
branch if summary overflow 
branch if not summary overflow 


branch if unordered (see floating point 
compare instructions) 


branch if not unordered (see floating 
point compare instructions) 


Table 4.4. Extended mnemonics for condition register bit encodings. 


Branch Semantics (X*) 
branch if less than (It) 
branch if less than or 
equal to (le) 

branch if equal to (eq) 
branch if greater than or 
equal to (ge) 

branch if greater than 
(gt) 

branch if not less than 
(nl) 

branch if not equal to 
(ne) 

branch if not greater 
than (ng) 


(bc) 


relative 
ble[I] 
ble[l] 


beg[I] 
bge[l] 


bet[I] 
bni [I] 
bne[]] 


bng{l] 


Target address type 

(bca) (bclr) (bcctr) 
absolute to link to count 
ble[l]a blelr{I] blectr[I] 
ble[I]a blelr{I] blectr[l] 
beq[lla beqlr[I] beqctr[]] 
bge[l]a bgelr[l] bgectr[]] 
bet[l]a bgtlr(l] betctr[I] 
bni[lJa bnillr{I] bnictr{I] 
bne[l]a bnelr[I] bnectr{]] 
bng{l]a bnglr[l] bngctr[I] 
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Target address type 
(bc) (bca) (bclr) (bcctr) 

Branch Semantics (X*) relative absolute to link to count 
branch if summary bso[I] bso[l]a bsolr[1] bsoctr[]] 
overflow (so) 

branch if not summary bns[I] bns{l]a bnslr[I] bnsctr{I] 
overflow (ns) 

branch if unordered (see bun{I] bun[l]a bunlr[I] bunctr[I] 


floating point compare 
instruction) (un) 


branch if not unordered bnu[]] bnul[l]a bnulr[l] bnuctr{]] 
(see floating point 

compare instructions) 

(nu) 

* The building block that is used to form the mnemonic is shown in parentheses beside the semantic 


description. The mnemonic for the branch is built out of three components: b[CR code] [target 


code] [I]. 





The final extended mnemonic component for the branch instructions gives an easy way to code 
the prediction bit. If the branch should be predicted as taken, then a plus (+) should be ap- 
pended to the mnemonic; if the branch should be predicted as not taken, then a minus (-) 
should be appended to the mnemonic. If the preferred direction is unknown, then neither a 
plus nor a minus should be appended to the mnemonic. Thus, 


blt+ CR2, LABEL 


would be a taken branch if bit 0 of condition register field 2 is a 1. If the branch is predicted, 
then it will be predicted as taken. The target of the branch is the label LABEL. Assuming that 
LABEL is five instructions ahead of the branch instruction, it could be coded equivalently as 
the following: 


bc @b01101, 8, 5 


If LABEL were five instructions behind the branch instruction, then it could be coded as the 
following: 


be 0b01100, 8, -5 


Note that in the second case, the offset is negative, so it will default to predicted taken. Thus, 
in the second case, the least significant bit of the BO field is set to 0, meaning use the default 
prediction. | 
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Compare Instructions, Examples 


Now that you have seen how to use the condition register to control the direction of branches, 
you need to have some way to get data into the condition register. You saw throughout the 
description of the integer instructions how to set condition register field zero with information 
about the result of an operation. There are other, more general ways, to set bits in the condi- 
tion register. 


There are a set of eight integer compare instructions that can be used to compare two values 
and place the results of that compare into any field in the condition register. The form of the 
compare instructions follows: 


compare BF,RA,RB/SI{|UI 


Where BF is a 3-bit field that identifies one of the eight condition register fields to update. 
One operand is the contents of general integer register RA. The second operand is one of the 
contents of RB, the signed immediate field SI, or the unsigned immediate field UI. 


The first four compare instructions are available on any PowerPC implementation. The com- 
pare word immediate instruction (cmpwi) and the compare word instruction (cmpw) perform 
signed compares of the two operands. The cmpwi instruction gets one operand from a general 
integer register, while the other is the sign extended 16-bit immediate field in the instruction. 
The cmpw instruction gets both of its operands from general integer registers. These instruc- 
tions set the following 4 bits of the condition register field selected by BF: bit 0 (LT) is set to 
1 if RA < SIIRB using a signed comparison; bit 1 (GT) is set to 1 if RA>SIIRB using a signed 
comparison; bit 2 (EQ) is set to 1 if RA=SIIRB using a signed comparison; and bit 3 is set to 
the value of the summary overflow bit in the XER. 


The compare logical word immediate (cmplwi) and compare logical word (cmplw) instructions 
are similar to the cmpwi and cmpw instructions, except that all comparisons are unsigned, and 
the immediate field (for cmplwi) is not sign extended (see Example 4.3). Table 4.5 lists the 32- 
bit integer compare instructions. 


EXAMPLE 4.3. 
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Table 4.5. 32-bit integer compare instructions. 


Instruction Definition Format 
compare word CR,, <- (RA) < (RB) II (RA) > (RB) cmpw BF, RA, 
Il (RA) == (RB) || XER[SO] RB 
compare word immediate CR, <- (RA) < sign_ext(SI)) II cmpwi BF, 
(RA) > sign_ext(SI)) II (RA) RA, SI 
== sign_ext(SI)) Il XER,, 
compare logical word CR, <- (RA) u< (RB) II (RA) cmplw BF, 
u> (RB) Il (RA) == (RB) Il XER,, RA, RB 
compare logical word CR, <- (RA) u< UD) Il cmplwi BF, 
immediate (RA) u> UI) II (RA) == UI) RA, UI 
ll XER,, 


* The relation u< is unsigned less than; the relation > is unsigned greater than. 


There are four compare instructions, which are available only on 64-bit implementations of 
the PowerPC architecture. These are analogous to the four word compare instructions, except 
that they perform their comparison on a doubleword. They are the compare doubleword (cmpd), 
compare doubleword immediate (cmpdi), compare logical doubleword (cmpld), and compare logi- 
cal doubleword immediate (cmpldi) instructions. Table 4.6 lists the 64-bit integer compare in- 
structions. 
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Table 4.6. 64-bit integer compare instructions. 





Instruction Definition Format 
compare doubleword CR,, <- ((RA) < (RB)) II cmpd BF, RA, RB 
((RA) > (RB)) Il (CRA) == 
(RB)) Il XER,, 
compare doubleword CR,, <- ((RA) < sign_ext cmpdi BF, RA, SI 
immediate (SI)) II((RA) > sign_ext 
(SI)) Il (CRA) == sign_ext 
(SI)) Il XER,, 
compare logical CR, <- ((RA) u< (RB)) II cmpld BF, RA, 
doubleword RB 
((RA) u> (RB)) Il (CRA) 
== (RB)) || XER,,a 
compare logical doubleword CR,, <- ((RA) u< UI) II cmpldi BF, RA, 
immediate ((RA) u> UD) II((RA) UI 
== UJ) Il XER,, 


* The relation u< is unsigned less than; the relation u> is unsigned greater than. 


Move To/From Special Branch Register 
Instructions 


There are extended mnemonics of the move to and move from special purpose register in- 
structions for the link register and the count register (see “Move To/From Special Register 
Instructions” in Chapter 3). The move to link register (mtlr) and move from link register (mflr) 
instructions are extended mnemonics that are used to move values to and from the link register 
(LR). The move to count register (mtctr) and move from count register (mfctr) instructions are 
extended mnemonics that are used to move values to and from the count register (CTR). See 
Example 4.4. 


EXAMPLE 4.4, 
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Three move instructions also are used to manipulate the condition register. The move to condi- 
tion register field instruction (mtcrf) moves the contents of a general integer register into the 
condition register under the control of an 8-bit mask contained in an immediate field. The 
move to condition register from XER instruction (mcrxr) copies the high-order 4 bits of the XER 
(the summary overflow bit, the overflow bit, the carry bit, and a reserved bit) into a field in the 
condition register. Finally, the move from condition register instruction (mfcr) copies the con- 
tents of the condition register into a general integer register. Table 4.7 lists the move to and 
from branch special purpose register instructions. 
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Table 4.7. Move to and from branch special purpose register instructions. 


Instruction Definition Format 
move to link register LR <- (RT) mtlr RT 
mtspr 8,RT 
move from link register RT <- (LR) mflr RT 
move to count register CTR <- (RT) mectr RT 
mtspr 9,RT 
move from count register RT <- (CTR) mfctr RT 
move to condition register CR <- mtcrf FXM,RS 
fields ((FXM, & EE (FXM, & CRo.03)) \| 
((FXM, & RS,,.,,.) | (FXM, 8& CR,,.,,.)) Il 
((FXM, & RS,,..,) | (FXM, & CR, .,)) Il 
((FXM, & RS...) | (FXM, & CR,,.,.)) Il 
((FXM, & RS_..,) | (FXM, & CR...) Il 
((FXM, & RS,,,,) | (FXM, & CR,,.,,)) Il 
((FXM, & RS,,,.) | (FXM, & CR,,,...)) I 
((FXM, & RS,,.,,) | (FXM, & CR,,.,,)) 
move to condition register CR,, <- XER,.; merxr BF 
from XER 
move from condition RT <- (CR) mfcr RT 
register 





Note that in the move to condition register fields section of the table, for 64-bit implementa- 
tions RS,, ., are moved into CR. The appropriate bit numbers can be obtained by adding 32 to 
the numbers given. 


CR Logical Instructions 


The PowerPC architecture supplies a set of instructions for directly manipulating the condi- 
tion register bits. ‘These are known as the condition register (CR) logical instructions; they enable 
the programmer to perform Boolean operations on bits in the condition register. These in- 
structions are useful for coding multiple condition branches using a single branch. The form 
of the condition register logical instructions follows: 


r-logical BT, BA, BB 


where BT, BA, and BB are 5-bit fields, each specifying a bit of the condition register. The BT 
bit is loaded with the result of the operation specified by the instruction and performed on the 
bits specified by the BA and BB fields. There are four extended mnemonics for the condition 
register logical instructions. These instructions are available on all PowerPC implementations 


(see Table 4.8). 
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Table 4.8. Condition register logical instructions. 


Instruction 
condition register AND 
condition register OR 


condition register XOR 


condition register NAND 


condition register NOR 
condition register EQV 
condition register AND 


condition register OR 
condition register set 
condition register clear 
condition register move 


condition register not 


Definition 

CRgy ail CRa, & CRyp 
CR, ~ CR, | CRyp 
CR, = CRiy @CR,, 
CR,, < CR,,TCR,, 
CR, - CR,, & CRs, 
CR, i CR . CRyp 
CRyy < CRy & CRy, 


CRyp < CRyg ! CR 
CR, <- 0b1 

CR, <- 0b0 

CR yy < CRyy 

CR,, < CR, 


Format 

crand BT, BA, BB 
cror BT, BA, BB 
crxor BT, BA, BB 
crnand BT, BA, BB 
crnor BT, BA, BB 
creqv BT, BA, BB 
crandc BT, BA, BB 
with complement 
crorc BT, BA, BB 
with complement 
crset BT creqv BT, BT, 
BT 

erclr BT 

crxor BT, BT, BT 


crmove BT, BA 
cror BT,BA, BA 


crnot BT, BA 
crnor BT, BA, BA 





The move condition register field instruction (mcrf ) copies the contents of one 4-bit condition 
register field into another 4-bit condition register field. The form of this instruction follows: 


mcrf BF, BFA 


The BF and BFA fields are 3-bits wide, and each field specifies one of the eight 4-bit fields in 
the condition register. The field specified by BFA is copied into the field specified by BF. ‘Table 


4.9 shows the move condition register field instruction. 


Table 4.9. Move condition register field instruction. 


Instruction 


move condition register 


field 


Definition 
CRgy <- CRo 


Format 
mcrf BF, BFA 
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Trap/System Call Instructions 


The trap and system call instructions enable the programmer to pass control of a program to the 
operating system. Control may be passed at two points in the system. The system call instruc- 
tion passes control to the system call interrupt vector; the trap instructions pass control to the 
system trap handler. 


The system callinstruction (sc) is used to tell the operating system to perform some service. The 
services available are operating-system dependent. This instruction is similar in function to the 
INTR instruction in Intel’s x86 architecture. 


There are four trap instructions. Two of the instructions are available on any PowerPC imple- 
mentation, and the other two are available only on 64-bit PowerPC implementations. The form 
of the trap instructions follows: 


trap TO,RA,SI/RB 


The trap instructions perform a function similar to the compare instruction, comparing RA to 
either SI sign extended to 32 bits (for immediate form trap instructions) or RB (for other trap 
instructions). The results then are masked by the 5-bit TO field, and action is taken on the 
masked result. The TO field mask bits have the same meaning on all trap instructions. Table 
4.10 lists the trap instruction mask bits for the TO field. 


Table 4.10. Trap instruction mask bits (TO field). 

Mask Bit Definition 
Less than, using signed comparison 
Greater than, using signed comparison 
Equal 


Less than, using unsigned comparison 


BR WOW NH — © 


Greater than, using unsigned comparison 


If, for any Ob1 in the TO field, the corresponding condition is met, then the trap is taken. If 
the trap is taken, then control of the program is passed to the system trap handler. 


The trap word (tw) and trap word immediate (twi) instructions are available on any PowerPC 
implementation and perform the comparison on a 32-bit word (the low-order 32 bits of the 
registers are used on 64-bit implementations). Table 4.11 lists the 32-bit trap instructions. 
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Table 4.11. 32-bit trap instructions. 





Instruction 


trap word 


trap word 
immediate 


Definition Format 


if (TO, & ((RA) < (RB))) | tw TO, RA, RB 
(TO, & ((RA) > (RB))) | 


(TO, & ((RA) == (RB))) | 
(TO, & ((RA) <u (RB))) | 
(TO, & ((RA) >u (RB)))) 
trap 

else 

don’t trap 


if ((TO, & ((RA) < sign_ext(SI))) | twi TO, RA, SI 
(TO, & ((RA) > sign_ext(SI))) | 

(TO, & ((RA) == sign_ext(SI))) | 

(TO, & ((RA) <u sign_ext(SI))) | 

(TO, & ((RA) >u sign_ext(SI)))) 

trap 

else 

don’t trap 


The trap doubleword (td) and trap doubleword immediate (tdi) instructions are available only 
on 64-bit PowerPC implementations and perform their comparison on 64-bit doublewords 
(the SI field is sign extended to 64 bits for the tdi instruction). Table 4.12 lists the 64-bit trap 


instructions. 


Table 4.12. 64-bit trap instructions. 


Instruction 


trap word 


trap word 


Definition Format 


if (TO, & ((RA) < (RB))) | td TO, RA, RB 
(TO, & ((RA) > (RB))) | 

(TO, & ((RA) == (RB))) | 

(TO, & ((RA) <u (RB))) | 

(TO, & ((RA) >u (RB)))) 

trap 

else 

don’t trap 


if ((TO, & ((RA) < sign_ext(SI))) | tdi TO, RA, SI 


continues 
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Table 4.12. continued 
Instruction 


immediate 


Definition Format 


(TO, & ((RA) > sign_ext(SI))) | 
(TO, & ((RA) == sign_ext(SI))) | 
(TO, & ((RA) <u sign_ext(SJ))) | 
(TO, & ((RA) >u sign_ext(SI)))) 
trap 

else 

don’t trap 


There are several extended mnemonics for the trap instructions. These are similar to the ex- 
tended mnemonics for the conditional branch instructions. These extended mnemonics are 
provided for the most commonly used TO field encodings. Table 4.13 lists the TO field en- 
coding for extended mnemonics, and Table 4.14 lists the extended mnemonics for trap in- 


structions. 


Table 4.13. TO field encoding for extended mnemonics. 


Code Definition DecimalTO < = 

It less than 16 I 0 0 

le less than or 20 l 0 I 
equal to 

eq equal to 4 0 0 1 

ge greater than or 12 0 l 
equal to 

gt greater than 8 0 l 0 

nl not less than 12 0 l l 

ne not equal to 24 1 l 0 

ng not greater than 20 l 0 l 

Ile logically less 2 0 0 0 
than 

lle logically less 6 0 0 l 
than or equal to 

lge logically greater 3 0 0 l 
than or equal to 

let logically greater l 0 0 0 
than 

Inl logically not less 5 0 0 l 


than 


<u 


-—- ©§ Oo O&O O 


>Uu 


Oo © 


oOo o © @ © 


not greater than 
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Code Definition DecimalTO < = <u >u 
Ing logically not 6 0 l l 0 
greater than 
<none> unconditional 3] 1 l l l 
Table 4.14, Extended mnemonics for trap instructions. 
32-Bit Trap 64-Bit Trap 
Instruction Instructions Instructions 
Semantics tw twi td tdi 
trap unconditionally trap a —- — 
trap if less than twlt twlti tdlt tdlti 
trap if less than or twle twlei tdle tdlei 
equal to 
trap if equal tweq tweqi tdeq tdeqi 
trap if greater than twge twgei tdge tdgei 
or equal to 
trap if greater than twet tweti tdet tdeti 
trap if not less than twnl twnli tdnl tdnli 
trap if not equal to twne twnei tdne tdnei 
trap if not greater twng twngi tdng tdngi 
than 
trap if logically twllt twllti tdllt tdllti 
less than 
trap if logically twlle twllei tdlle tdllei 
less than or equal to 
trap if logically greater twlge twlgei tdlge tdlgei 
than or equal to 
trap if logically twlgt twleti tdigt tdigti 
greater than 
trap if logically twlnl twlnli tdinl tdinli 
not less than 
trap if logically twling twlngi tding tdingi 


inane 


he 
had = ‘. 
ae La A 
ae ate 
’ re 


ul 


vee 
ay 


e 


oe 


ra 
“re 
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ae 
7 


“a5 5 


a 


~ 
— 


, 
* > 


> 

Sev 
ee 
‘ 





Floating-Point Instructions 
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The PowerPC architecture includes floating-point instructions and registers that are imple- 
mented in the floating-point unit integrated in PowerPC chips. Like the integer instruction 
set, the floating-point instruction set is a load/store architecture where the only instructions 
that interface to system memory are load and store instructions, which load data into general 
floating-point registers and store data from general floating-point registers. There is very little 
difference in the floating-point architectures of the 32-bit PowerPC architecture and the 64- 
bit PowerPC architecture. Unless otherwise noted, all instructions are available on all imple- 
mentations of the PowerPC architecture. 


The floating-point unit supports both single and double precision operations, but the 

floating-point registers support only double precision format. When operations are performed, 
the inputs are taken as double precision values and an infinitely precise result is formed. This 
result then is rounded to the target precision (single precision for floating-point single preci- 
sion operations, and double precision for floating-point double precision operations) under 
control of the rounding mode (see Chapter 2, “Introduction to PowerPC Architecture”). 


Floating-Point Load and Store Instructions 


There are separate load and store instructions for the floating-point unit. The address for the 
floating-point load and store instructions, however, is generated in the integer unit just like it 
was for the integer load instructions. Each of these instructions supports an update mode that 
is the same as the update mode form of the integer load instructions. All these instructions also 
support both immediate and indexed addressing modes, just like in the integer load and store 
instructions. 


There are two floating-point load instructions. The /oad floating-point single (\fs{u][x}) instruc- 
tion loads a 32-bit single precision floating-point number from memory into a general 
floating- point register. The load floating-point double (|fd[ul [x]) instruction loads a 64-bit double 
precision floating-point number from memory into a general floating-point register. When an 
Ifs instruction is used, the data being loaded is extended to a double precision value before it is 
placed in a register. Single precision positive and negative infinity and Not a Number (NaN) 
values are translated to double precision infinities and NaNs. Zero is left zero, and normalized 
numbers are zero extended to the larger mantissa (zeros added to the low-order bit positions). 
Denormalized numbers are normalized as double precision numbers (see Chapter 2). Table 
5.1 lists the floating-point load instructions. 


Table 5.1. Floating-point load instructions. 


Instruction Definition Format 
load floating-point single FRT <- DOUBLE( Ifs[u] FRT, 
MEM(sign_ext(D)+(RA)| D(RA) 


0, 4)) 
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Instruction Definition Format 

load floating-point single FRT <- DOUBLE( MEM lfs[u]x FRT, 

indexed ((RB)+(RA)IO, 4)) RA,RB 

load floating-point double FRT <- MEM( sign_ext(D)+(RA) — Ifd[u] FRT, 
10,8) D(RA) 

load floating-point double FRT <- MEM((RB)+(RA)IO, 8) lfd{u]x FRT, 

indexed RA,RB 


a. MEM(A,B) is a function that points to B sequential bytes in memory, starting at address A. 


b. The [u] in the mnemonic indicates that the instruction can be coded with the u, meaning update 
form, or without the u, meaning no update form. In the update form, RA is loaded with the 
effective address of the load. Note that using 0 for RA with the update form of the instruction is 


invalid and results in an exception. 





There are two floating store instructions. The store floating-point single instruction (stfs[u] [x]) 
stores the contents of a general floating-point register as a 32-bit single precision floating-point 
number into memory. The store floating-point double instruction (stfd[u][x]) stores the con- 
tents of a general floating-point register as a 64-bit double precision floating-point number 
into memory. The store floating-point single instruction rounds the double precision value 
stored in the general floating-point register being stored to a single precision number before 
storing it. This transformation also may involve taking a double precision normalized number 
into a single precision denormalized number format. Table 5.2 lists the floating-point store 
instructions. 


Table 5.2. Floating-point store instructions. 


Instruction Definition’ Format? 

store floating-point single MEM(sign_ext(D)+(RA)?(RA) stfs[u] FRS, 
:0, 4) <- round_single((FRS)) D(RA) 

store floating-point single MEM((RB)+(RA)?(RA):0, stfs[u]x FRS, 

indexed 4) <- round_single((FRS)) RA,RB 

store floating-point double MEM(sign_ext(D)+(RA)?(RA) stfd[{u] FRS, 
:0, 8) <- (FRS) D(RA) 

store floating-point double MEM ((RB)+(RA)?(RA):0, 8) stfd[u]x FRS, 

indexed <- (FRS) RA,RB 


a. MEM(A,B) is a function that points to B sequential bytes in memory, starting at address A. 


b. The [u] in the mnemonic indicates that the instruction can be coded with the u, meaning update 
form, or without the u, meaning no update form. In the update form, RA is loaded with the 
effective address of the load. Note that using 0 for RA with the update form of the instruction is 
invalid and results in an exception. 
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Like many of the integer instructions, many of the floating-point instructions may update the 
condition register implicitly, in addition to updating the target floating-point register with the 
result of the operation. Floating-point instructions that update the condition register update 
field 1 (instead of field 0, which the integer instructions updated). The result of the operation 
is compared to 0, and the results are stored in the condition register field. The first 4 bits of the 
FPSCR (FX, FEX, VX, DX) are possibly updated by the instruction and copied into condition 
register field 1. 


In order to have an instruction implicitly update condition register field 1, the instruction 
mnemonic has a dot [.] appended to it. This form is not available for all floating-point instruc- 
tions. 


Arithmetic Instructions, Examples 


The floating-point arithmetic instructions enable the programmer to manipulate floating-point 
values that have been loaded into the floating-pointt general registers. The arithmetic opera- 
tions supported include add, subtract, multiply, and divide. The following sections describe a 
few other arithmetic operations that are supported. 


Floating-Point Move Instructions 


Four instructions move the contents of one floating-point register into another floating-point 
register, possibly performing some unary operation as they do so. The general form of these 
instructions follows: 


move FRT, FRB 


These instructions can be coded to update the condition register implicitly, and they can be 
applied to single or double precision data. 


The floating-point move register instruction (fmr|.]) simply copies the contents of FRB into FRT. 
The floating-point negate instruction (fneg[.]) copies the additive inverse of the contents of FRB 
into FRT. Because the mantissa is represented in signed magnitude notation, negation means 
inverting the sign bit of the mantissa. The floating-point absolute value instruction (fabs[.]) stores 
the absolute value of FRB into FRT. This involves setting the sign bit of the mantissa to 0 as it 
is stored into FRT. Finally, the floating-point negative absolute value instruction (fnabs|.]) finds 
the absolute value of the contents of FRB, and then negates it before storing it into FRT. Table 
5.3 lists the floating-point move instructions. 


Table 5.3. Floating-point move instructions. 
Instruction Definition Format 
floating-point register move FRT <- (FRB) fmr[.] FRT, FRB 
floating-point negate FRT <- 0 — (FRB) fneg[.] FRT, FRB 
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Instruction Definition Format 
floating-point absolute value FRT <- |(FRB)| fabs[.] FRT, FRB 
floating-point negative FRT <- 0 — |(FRB)| fnabs[.] FRT, FRB 


absolute value 


a. The [.] in the mnemonic indicates that the instruction can be coded with a dot if condition 
register field 1 should be updated, or without a dot if condition register field 1 should not be 
updated. 





Floating-Point Add/Subtract Instructions 


There are two floating-point add and two floating-point subtract instructions. The general form 
of these instructions follows: 


add FRT, FRA, FRB 


where the operation specified by the instruction is performed on the contents of floating-point 
registers FRA and FRB, and the result of the operation is stored into floating-point register 
FRT. These instructions optionally can update condition register field one, as described ear- 
lier. 


The floating-point add single instruction (fadds[.]) adds two single precision numbers; the floating- 
point add double instruction (fadd|[.]) adds two double precision numbers. The main differ- 
ence between these two instructions is the precision to which the result is rounded. Some imple- 
mentations may have higher performance with the single precision operation. 


The floating-point subtract single instruction (fsubs[.]) subtracts two single precision numbers; 
the floating-point subtract double instruction (fsub[.]) subtracts two double precision numbers. 
The main difference between these two instructions is the precision to which the result is 
rounded. Table 5.4 lists the floating-point add and subtract instructions. 


Table 5.4. Floating-point add and subtract instructions. 


Instruction Definition Format 
floating-point add single FRT <- round_single fadds[.] 
precision ( (FRA) + (FRB) ) FRT,FRA,FRB 
floating-point add double FRT <- round_double fadd[.] FRT, 
precision ( (FRA) + (FRB) ) FRA,FRB 
floating-point subtract single FRT <- round_single fsubs[.] 


precision ( (FRA) — (FRB) ) FRT,FRA,FRB 


continues 
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Table 5.4. continued 
Instruction Definition Format 
floating-point subtract double | FRT <- round_double fnsub[.] 
precision ( (FRA) — (FRB) ) FRT,FRA,FRB 
a. round_single(A) is a function which rounds the infinite precision number A to a single precision 
number (see Appendix D, “A Detailed Floating-point Model”). 


b. round_double(A) is a function which rounds the infinite precision number A to a double 
precision number. 

c. The [.] in the mnemonic indicates that the instruction can be coded with a dot if condition 
register field 1 should be updated, or without a dot if condition register field 1 should not be 
updated. 





Floating-Point Multiply and Divide Instructions 


There are two multiply and two divide instructions, which are analogous to the add and sub- 
tract instructions. That is, there is a floating-point multiply single (fmuls[.]) and a floating-point 
multiply double (fmul[.]) instruction. The form of the multiply instructions follows: 


multiply FRT, FRA, FRC 


These instructions multiply the contents of floating-point registers FRA and FRC, placing the 
result into FRT. Similarly, there is a floating point divide single (fdivs|.]) and a floating-point 
divide double (fdiv[.]) instruction. The form of these instructions follows: 


divide FRT, FRA, FRB 


For the divide instructions, the contents of FRA are divided by the contents of FRB, and the 
result is placed into FRT. Table 5.5 lists the floating-point multiply and divide instructions. 


Table 5.5. Floating-point multiply and divide instructions. 


Instruction Definition Format 
floating-point multiply single FRT <- round_single fmuls[.] FRT, 
precision ( (FRA) x (FRC) ) FRA,FRC 
floating-point multiply FRT <- round_double fmul[.] FRT, 
double precision ( (FRA) x (FRC) ) FRA,FRC 
floating-point divide single FRT <- round_single fdivs[.] FRT, 
precision ( (FRA) / (FRB) ) FRA,FRB 
floating-point divide double FRT <- round_double fdiv[.] FRT, 


precision ( (FRA) / (FRB) ) FRB 
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Instruction Definition Format 
a. round_single(A) is a function which rounds the infinite precision number A to a single precision 
number (see Appendix D). 


b. round_double(A) is a function which rounds the infinite precision number A to a double 
precision number. 


c. The [.] in the mnemonic indicates that the instruction can be coded with a dot if condition 


register field 1 should be updated, or without a dot if condition register field 1 should not be 
updated. 





Combined Multiply and Add Instructions 


The PowerPC architecture supports a set of floating-point instructions that combine multiply 
and add operations. The general form of these instructions follows: 


multiply-add FRT, FRA, FRC, FRB 


The contents of FRA and FRC are multiplied together. The result of the multiplication then 
is added to the contents of FRB. The result of this whole operation then is placed into FRT. 


There are eight multiply-add instructions. The floating-point multiply add single (fmadds[.]) 
and floating-point multiply add double (fmadd[.]) instructions perform the exact function de- 
scribed in the preceding example. The floating-point multiply subtract single (fmsubs|.]) and float- 
ing-point multiply subtract double (fmsub[.]) instructions store the result of FRA times FRC 
minus FRB into FRT. The last four multiply-add instructions are analogous to the first four, 
except that the final result is negated before being placed into FRT. These instructions are the 
floating-point negative multiply add single (fnmadds|.]), floating-point negative multiply add double 
(fnmadd{.]), floating-point negative multiply subtract single (fnmsubs[.]), and floating-point nega- 
tive multiply subtract double (fnmsub|.]) instructions. Table 5.6 lists the floating-point multi- 
ply accumulate instructions. 


Table 5.6. Floating-point multiply accumulate instructions. 


Instruction Definition Format 
floating-point multiply-add FRT <- round_single( fmadds|.] 

single precision (FRA)x(FRC) + (FRB) ) FRT,FRA,FRB,FRC 
floating-point multiply-add FRT <- round_double( fmadd{.] 

double precision (FRA)x(FRC) + (FRB) ) FRT,FRA,FRB,FRC 
floating-point multiply- FRT <- round_single( fmsubs|.] 

subtract single precision (FRA)x(FRC) — (FRB) ) FRT,FRA,FRB,FRC 


continues 
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Table 5.6. continued 


Instruction Definition Format 

floating-point multiply- FRT <- round_double( fmsub[.| 

subtract double precision (FRA)x(FRC) — (FRB) ) FRT,FRA,FRB,FRC 

floating-point negative FRT <- round_single( fnmadds|.] 

multiply-add single 0 — ((FRA)x(FRC)+ FRT,FRA,FRB,FRC 

precision (FRB)) ) 

floating-point negative FRT <- round_double fnmadd|.| 

multiply-add double (0 — ((FRA)x(FRC)+ FRT,FRA,FRB,FRC 

precision (FRB)) ) 

floating-point negative FRT <- round_single fnmsubs[.] 

multiply-subtract single 0 — ((FRA)x(FRC)— FRT,FRA,FRB,FRC 

precision FRB)) ) 

floating-point negative FRT <- round_double fnmsub{. ] 

multiply-subtract double 0 — ((FRA)x(FRC)— FRT,FRA,FRB,FRC 

precision FRB)) ) 

a. round_single(A) is a function which rounds the infinite precision number A to a single precision 
number (see Appendix D). 


b. round double(A) is a function which rounds the infinite recision number A to a double 
Pp 
precision number. 


c. The [.] in the mnemonic indicates that the instruction can be coded with a dot if condition 
register field 1 should be updated, or without a dot if condition register field 1 should nor be 
updated. 


Floating-Point Compare Instructions, Examples 


There are two floating-point compare instructions. These instructions are used to set a condi- 
tion register field based on a comparison of the contents of two general floating-point registers. 
The condition register field then can be examined by branch instructions to direct program 
flow. The form of the floating-point compare instructions follows: 


compare BF, FRA, FRB 


The two floating-point compare instructions update the 4-bit condition register field in the 
same way: 


M Bit 0 is set if (FRA) < (FRB). 
WM Bit 1 is set if (FRA) > (FRA). 
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M@ Bit 2 is set if (FRA) = (FRB). 
M@ Bit 3 is set if (FRA) or (FRB) is not a Number (NaN). 


Both floating-point compare instructions also put the results of the comparison into the float- 
ing-point condition code bits of the floating-point status and control register. The difference 
between these two instructions is in how the floating-point status and control register are up- 
dated for quiet and signaling NaNs (see Appendix D). The two instructions are the floating- 
point compare unordered (fcmpu) and floating-point compare ordered (fcmpo) instructions. Table 
5.7 lists the floating-point compare instructions. 


Table 5.7. Floating-point compare instructions. 


Instruction Definition Format 
floating-point compare CR, <- (FRA)<(FRB) fempu BF,FRA,FRB 
unordered II(FRA)>(FRB) || (FRA) 


==(FRB) II((FRA) is a 

NaN or (FRB) is a NaN) 
floating-point compare CR, <- (FRA)<(FRB) fempo BF,FRA,FRB 
ordered Il(FRA)>(FRB) || (FRA) 

==(FRB) II((FRA) is a 

NaN or (FRB) is a NaN) 


Floating-Point Rounding and Conversion 
Instructions 


The PowerPC architecture defines a set of floating-point instructions that can be used to round 
floating-point numbers and to convert between integer and floating-point formats in the gen- 
eral floating-point registers. 


Floating-Point Rounding Instructions 


The floating-point round to single instruction (frsp[.]) rounds the contents of its one general float- 
ing-point register operand (FRB) to single precision, using the rounding mode specified in the 
floating-point status and control register. If the contents of FRB already are in single precision 
format, then they are left the same. The result of the rounding is placed into the target register 
(FRT). This is the only floating-point rounding instruction defined in the PowerPC architec- 
ture (see Table 5.8). 
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Table 5.8. Floating-point rounding instruction. 





Instruction Definition Format 

floating-point round to single FRT <- round_single frsp[.] FRT,FRB 
((FRB)) 

a. round_single(A) is a function which rounds the infinite precision number A to a single precision 


number (see Appendix D). 


b. The [.] in the mnemonic indicates that the instruction can be coded with a dot if condition 
register field 1 should be updated, or without a dot if condition register field 1 should not be 
updated. 





Floating-Point Conversion Instructions 


The floating-point conversion instructions convert values in the general floating-point regis- 
ters from integers to floating-point format and from floating-point format to integer format. 
These instructions are the assembly language equivalent of casting an integer to a floating-point 
number or vice versa. There are no instructions for moving the contents of a floating-point 
register into an integer register, so this must be done through memory using loads and stores. 
The general form of the conversion instructions follows: 


convert FRT, FRB 


The contents of FRB are converted and placed into FRT. Any numbers greater than the largest 
representable integer are set to the maximum representable integer, and any numbers with greater 
magnitude than the smallest (most negative) representable integer are set to the smallest repre- 
sentable integer. 


There are four instructions that convert floating-point numbers to integers. The first two in- 
structions are available on any implementation of the PowerPC architecture. The floating-point 
convert to integer word instruction (fctiw[.]) converts the operand in FRB to a 32-bit integer, 
using the rounding mode specified in the floating-point status and control register. The 
floating-point convert to integer word with round to zero instruction (fctiwz[.]) converts the op- 
erand in FRB to an integer word using round toward zero, regardless of what rounding mode 
is specified in the floating-point status and control register. Table 5.9 lists the 32-bit floating- 
point conversion instructions. 


Table 5.9. 32-Bit Floating-point conversion instructions. 





Instruction Definition* Format* 
floating-point convert to FRT,, ., <- convert_ fctiw[.] FRT, FRB 
integer word to_integer((FRB),32, 

FPSCR, JERT.., 


<- undefined 
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Instruction Definition Format’ 
floating-point convert to PRT <- convert_to_ fctiw[.] FRT, FRB 
integer word with round integer((FRB),32, 0b01) 

to zero FRT),,,<- undefined 


a. convert_to_integer (A,B,C) is a function that converts the floating-point number A into a B-bit 


integer using rounding specified by C (see Appendix D). 


b. The [.] in the mnemonic indicates that the instruction can be coded with a dot if condition 
register field 1 should be updated, or without a dot if condition register field 1 should not be 
updated. 





The second two convert to integer instructions are available only on 64-bit implementations 
of the PowerPC architecture. The floating-point convert to integer doubleword instruction (fctid[.]) 
converts the operand in FRB to a 64-bit integer using the rounding mode specified in the floating- 
point status and control register. The floating-point convert to integer doubleword with round to 
zero instruction (fctidz[.]) converts the operand in FRB to an integer doubleword using round 
toward zero, regardless of what rounding mode is specified in the floating-point status and 
control register. 


There is one instruction that converts from integer to floating-point format; it is available only 
on 64-bit implementations of the PowerPC architecture. The floating-point convert from inte- 
ger doubleword instruction (fcfid[.]) converts an integer value in floating-point register FRB to 
a double precision floating-point number. Table 5.10 lists the 64-bit floating-point conver- 
sion instructions. 


Table 5.10. 64-bit floating-point conversion instructions. 


Instruction Definition Format 
floating-point convert to FRT <- convert_to_ fctid[.] FRT,FRB 
integer double word integer*((FRB),64, FPSCR,,..) 
floating-point convert to FRT <- convert_to_ fctid[.] FRT,FRB 
integer doubleword with integer((FRB),64, 0b01) 
round to zero 
floating-point convert from FRT <- round_double fcfid[.] FRT,FRB 
integer doubleword ( convert_from_integer? 

((FRB), 64) 


a. convert_to_integer (A,B,C) is a function that converts the floating-point number A into a B-bit 
integer using rounding specified by C (see Appendix D). 


b. convert_from_integer (A,B) is a function that converts the 64-bit integer in A into a fully precise 
(64-bit mantissa) floating-point number (see Appendix D). 
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Table 5.10. continued 
Instruction Definition Format 


c. The [.] in the mnemonic indicates that the instruction can be coded with a dot if condition 
register field 1 should be updated, or without a dot if condition register field 1 should not be 
updated. 


Status and Control Registers, Examples 


There are four instructions that can be used to alter fields in the floating-point status and con- 
trol register (FPSCR). The first two instructions that are used to change the FPSCR are the 
move to FPSCR field immediate (mtfsfi[.]) and the move to FPSCR fields (mtfsf|.]) instructions. 
The mefsfi[.] instruction copies a 4-bit immediate field into one of eight 4-bit fields in the 
FPSCR. The form of this instruction follows: 


mtfsfi[.] BF, U 


The U field is the 4-bit immediate value, and the BF field specifies which 4-bit field in the 
FPSCR should be updated. If BF is equal to 0, then the update is handled differently from the 
other cases. In this case, the floating-point exception summary bit (FX; bit 0) is set to the value 
of bit 0 of the U field, and the floating-point overflow exception bit (OX; bit 3) is set to the 
value of bit 3 in the U field. The floating-point enabled exception summary bit (FEX; bit 1) 
and the floating-point invalid operation exception summary bit (VX; bit 2) are set by the usual 
rule (see Appendix D), rather than to the values contained in bits 1 and 2 of the U field. 


The mtfsf[.] instruction copies the low-order 32 bits from a general floating-point register into 
the FPSCR under control of an 8-bit mask contained in an immediate field in the instruction. 
The form of the mtfsf].] instruction follows: 


mtfsf[.] FLM, FRB 


The FLM field is an 8-bit mask that specifies which of the eight 4-bit fields are to be updated 
from the contents of the source register—FRB. If field 0 is specified (bit 0 of the FLM field is 
a 1), then field 0 is updated similarly to the mrfsfi[.] instruction. That is, the FX and OX bits 
are set from the contents of FRB, while the FEX and VX bits are set using the normal rule (see 
Appendix D). 


The second two instructions that are used to set the FPSCR are used to set a single bit to a 1 or 
a 0. The general form of these instructions follows: 


set BT 
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The BT field is a 5-bit field that specifies which bit of the FPSCR is to be set, while the value 
to which it should be set is specified by the instruction. The reset FPSCR bit instruction 
(mrfsbO[.]) is used to set bit BT to 0; the set FPSCR bit instruction (mtfsb] [.]) is used to set bit 
BT of the FPSCR to 1. Neither of these instructions can be used to set or reset the FEX or VX 
bits (bits 1 and 2) of the FPSCR. 


Two instructions enable the programmer to read the FPSCR. The move from FPSCR instruc- 
tion (mffs[.]) copies the contents of the FPSCR into the low-order 32 bits of a floating-point 
target register. The move to condition register from FPSCR instruction (mcrfs) copies a 4-bit field 
from the FPSCR into one of eight 4-bit fields in the condition register. The form of the mcrfs 
instruction follows: 


mcrfs BF, BFA 


The 3-bit BFA field specifies which field in the FPSCR should be moved, and the 3-bit BF 
field specifies to which field in the condition register the FPSCR field should be moved. When 
an exception bit in the FPSCR is copied into the condition register using these instructions, it 
is set to 0, except for the FEX and VX bits (bits 1 and 2), which are updated in the normal way 
(see Appendix I). Table 5.11 lists the floating-point status and control register instructions. 


Table 5.11. Floating-point status and control register instructions. 





Instruction Definition Format * 

move to condition FPSCR,,, <- FPSCR,,., mcerfs, BF, 

register from BFA 

FPSCR? 

move from FPSCR FRT’,. ., <- FPSCR mffs[.] FRT 
FRT,,,, <- undefined 

move to FPSCR FPSCR,,<- U!=0?U,,,:U, mtefsfi[.] BF, 

field immediate IFEX lvxllU, U 

move to FPSCR FPSCR <- mtfsf].] 

fields ((FLM,&FRB,,) | (FLM,&FPSCR,)) II FLM,FRB 
FEX || VX || 


((FLM,&FRB,,) | (FLM,&FPSCR,)) II 
((FLM,&FRB,.,,) | (FLM,&EPSCR,,...) I 
((FLM,&FRB,,,.) | (FLM,&FPSCR,,..)) 
((FLM,&ERB,, ,) | (FLM,8&EFPSCR.....)) I 


0 
2 sans | 
3 alee : 
((FLM ,&FRB,, ..) | (FLM,&FPSCR , ,.,,)) | 
5 phesite 
6 pring 
7 7 


44:47 
48:51 


((FLM,&FRB,, ..) | (FLM,&FPSCR,, ,.)) II 
((FLM,&FRB,,..) | (FLM.&FPSCR,,.,..) II 
((FLM,&FRB,,,.,) | (FLM_8FPSCR,, ,,.)) 


52:55 
56:59 
60:63 
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Table 5.11. continued 
Instruction Definition Format“ 
reset FPSCR bit FPSCR,,,. <- Ob0 mtfsbO[.] 
BT 
set FPSCR bit FPSCR,,.. <- 0b1 mtfsb 1 [.] 
BT 


a. The [.] in the mnemonic indicates that the instruction can be coded with a dot if condition 
register field 1 should be updated, or without a dot if condition register field 1 should not be 
updated. 


b. Note that FPSCR,,., indicates that the 4-bit field within the FPSCR specified by BFA. Thus for a 
BFA value of 0, FPSCR,,.,=FPSCR, ,=FPSCR,. sry vx ox" 
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While an assembly language program has much more freedom to define its own conventions 
for interfacing between its internal functions, when that code has to interface with other rou- 
tines certain standard conventions must be followed. This chapter covers the details of those 
conventions for making function calls, both within your program and to system libraries or 
compiled code. It also covers object module formats and the impact of dynamic binding on 
making function calls and writing functions. 


Since so many operating systems are being developed for the PowerPC microprocessors, some 
of this information may not apply to your system. At the time of this writing this information 
is accurate for all the PowerPC platforms. However, you should consult your system docu- 
mentation for more details about your specific platform. 


Subroutine Linkage Conventions 


The current PowerPC development tools follow certain conventions in the programming of 
the PowerPC processors. These conventions are similar in spirit to the conventions used on 
the 680x0 based Macintosh or the x86 DOS conventions, but are specific to the PowerPC. 


Architecture 


These conventions are set by the application binary interface (ABI), not by the PowerPC Ar- 
chitecture. Each operating system may have its own ABI, or may share a common ABI with 
other operating systems. The ABI illustrated in this chapter is the PowerOpen ABI, which is 
used by all of the current PowerPC development tools. These conventions are usually known 
as subroutine linkage conventions because they mostly address the programming of function calls. 


The basic goal of the linkage conventions is to allow a function to make assumptions about the 
state of the processor after calling or being called by another function. The other major goal is 
the fast execution of function calls, achieved mostly by minimizing the amount of interaction 
with memory during a function call, especially for argument passing. 


POWERPC FUNCTION CALLS 
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Register Usage Conventions 


The register usage conventions specify which registers are dedicated to certain uses and which 
registers must have their contents left unchanged by a called function. Those registers whose 
contents may be changed by a function are known as volatile registers, and those whose con- 
tents may not be changed are known as nonvolatile registers. This doesn’t mean a nonvolatile 
register cannot be changed by a function, but rather that if a function needs to use a nonvola- 
tile register, it must save the contents before the register is modified and restore the original 
contents before returning to the calling function. That is, it is the called function’s responsibil- 
ity to save and restore nonvolatile register contents, not the calling function’s responsibility. 
This is more efficient because the author of each function knows which registers it needs, and 
can save and restore only those registers, whereas, if the calling function was responsible, it could 
be forced to make worst-case assumptions about the behavior of other functions, which would 
result in unnecessary saving and restoring of registers that are not actually changed by the called 
function. 


GPR Usage Conventions 


The register usage conventions for the general purpose registers are shown in Table 6.1. There 
are two dedicated registers, GPR 1, used as the stack pointer (SP), and GPR 2, used as the 
TOC pointer (RTOC). For the other volatile GPRs, the table shows a role each register may 
play during a function call. Notice that since GPRs 0 and 3 — 12 are volatile, if a function can 
be written to use only these registers in its computations, no registers need to be saved or re- 
stored. Also, since GPRs 13 — 31 are nonvolatile, there are essentially 19 words of storage for 
intermediate values available that can be used when a function needs to call another function, 
which allows you to save more than just the values in the volatile registers. ‘This will improve 
the performance of a function that needs to preserve values when making multiple function 
calls because the nonvolatile register values can be stored to memory once and the registers 
used freely, instead of storing intermediate values to memory before each function call. 


Table 6.1. General-purpose register conventions. 


Register Status Use 

GPRO Volatile In function prologs 

GPR 1 Dedicated Stack pointer (SP) 

GPR 2 Dedicated TOC pointer (RTOC) 

GPR 3 Volatile Argument word 1; Return value word 1 
GPR 4 Volatile Argument word 2; Return value word 2 
GPR 5 - Volatile Argument word 3 - argument word 8 


GPR 10 
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Table 6.1. continued 


Register Status Use 

GPR 11 Volatile In calls by pointer; Environment pointer in some 
programming languages (for example, PASCAL) 

GPR 12 Volatile In glink code; In exception handling code in some 
programming languages 

GPR 13 - Nonvolatile Must be preserved across function calls 

GPR 31 


eC LL LLL 


When using the nonvolatile GPRs, you should use them starting with GPR 31 and working 
your way backward, rather than starting with GPR 13. This is historically related to the 
load-and-store multiple instructions from the POWER architecture, which load or store all 
registers from a specified register through register 31. In the PowerPC processors, however, the 
load-and-store multiple instructions can inflict serious performance penalties, so the preferred 
method of saving and restoring the GPRs is to use a series of load or store word instructions. 
Allocating nonvolatile GPRs starting with GPR 31 is still useful then, because a single routine 
performing load or store word instructions can be written with an entry-point at each instruc- 
tion, corresponding to the lowest numbered nonvolatile GPR used. Using this subroutine can 
result in smaller code, since potentially large numbers of load-and-store instructions can be 
omitted from each function call. 


FPR Usage Conventions 


The usage conventions for the floating-point registers are shown in Table 6.2. There are no 
floating-point registers with dedicated purposes. Floating-point registers can be used to pass 
arguments and return values during a function call, much like the GPRs, however, only 
floating-point values are passed this way. Like the GPRs, the nonvolatile FPRs should be used 
starting with FPR 31 and working backwards. 


Table 6.2. Floating-point register conventions. 


Register Status Use 

FPR O Volatile Scratch 

FPR 1 Volatile FP argument 1; Return FP value 1 
FPR 2 Volatile FP argument 2; Return FP value 2 
FPR 3 Volatile FP argument 3; Return FP value 3 
FPR 4 Volatile FP argument 4; Return FP value 4 
FPR 5 - FPR 13 Volatile FP argument 5 - FP argument 13 


FPR 14 - FPR 31 Nonvolatile Must be preserved across function calls 
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SPR Usage Conventions 


The conventions for using the special purpose registers are shown in Table 6.3. Most of these 
registers are not explicitly used during a function call and are made volatile because their values 
are usually not useful for very long. The only nonvolatile values are the condition register fields 
CR2 - CR4. Since these represent only part of the condition register, to preserve these fields 
you must save and restore the entire condition register. If you do not use these fields, then you 
don’t need to save and restore the condition register. There are no usage conventions for the 
SPRs not shown in Table 6.3, as they are usually manipulated only by custom assembly code, 
typically in the operating system. Compilers will not normally generate any code to manipu- 
late them. 


Table 6.3. Special-purpose register conventions. 


Register Status Use 

LR Volatile For branching and returning 

CTR Volatile For branching and loop counts Fixed-point 
status register Floating-point status register 

XER Volatile Scratch condition register fields 

FPSCR Volatile Must be preserved across function calls 

CRO, CRI Volatile Scratch condition register fields 


CR2, CR3, CR4 Nonvolatile 
CR5, CR6, CR7 Volatile 


Stack Usage Conventions 


Like the x86 and 680x0 processors, the PowerPC programming model utilizes a stack to store 
information associated with a function, such as local (automatic) variables, and as a way for 
functions to transfer data during a function call, such as function arguments. The stack is a 
last-in first-out (LIFO) data structure, meaning that the last thing you put into a stack is the 
first thing out. In the context of making function calls, the elements that are being put (or 
pushed) onto the stack are called stack frames, which contain information for the function that 
is being called. When a function returns, its stack frame is removed (or popped) from the stack, 
so the currently executing function always has its stack frame at the top of the stack. 


As is the convention in most computers, the PowerPC stack grows from higher addresses 
towards lower addresses. This can be confusing because the end of the stack with the lowest 
address is actually referred to as the top of the stack. The stack pointer (SP), GPR 1 by conven- 
tion, always contains the address of the top of the stack, or more precisely, the topmost (lowest 
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addressed) word on the stack. When a stack frame is pushed onto the stack, the SP is decremented 
by the size of the stack frame, and when a stack frame is popped, the SP is incremented accord- 


ingly. 


Stack Alignment 


The top of the stack is required to be quadword aligned, that is the value in SP must be an 
multiple of sixteen. Thus each stack frame must be a multiple of sixteen bytes in size. In order 
to maintain this requirement, each stack frame must include zero to fifteen bytes of alignment 


padding. 


Stack Frame Layout 


The structure of a stack frame for a single function is shown in Figure 6.1. The figure shows 
the various areas of the stack frame, their offsets relative to the SP, the areas that are optional, 
and the areas that are actually used by the function associated with the stack frame. 


FIGURE 6.1. 


Layout of a function’s stack 
frame. Grayed areas are not ae anna 664 
used by this function. 
Addresses are shown 
relative to the stack pointer 
from which they are 
addressed. 


FrameMaker 4 [ed pohaokehaptert) 
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Each area of the stack frame is described in detail below. 


Link Area 


The link area of the stack frame contains the fields used most often during function calls. The 
first field contains the address of the top of the caller’s stack frame, that is the value in SP when 
a function is called is stored here before SP is changed. This field is sometimes called the back 
chain, as it allows you to find each stack frame by tracing the pointers back through the stack. 
The remainder of the link area is used by any functions called from this function, and so should 
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be considered volatile across these function calls. There are two words reserved for use by the 
compiler and the binder, and the other three words are used by any called function to store the 
value of the link register, the condition register, and the RTOC (GPR 2). Note that the cur- 
rent function uses the corresponding fields in its caller’s stack frame, though those fields are 
not shown in the figure. The TOC save area is usually only used when glink (global linkage) 
helper functions are used to complete a function call. 


The link area is required to be in a stack frame and is always exactly 24 bytes long. The link 
area is addressed in the context of the calling function, that is when SP points to the top of the 
caller's stack frame, and begins at SP + 0. Functions use the link area of their callers during 
function entry and exit. 


Argument Passing Area 


The argument passing area is used by the current function to pass parameters to any functions 
it may call and to retrieve return values. Space is always allocated for the first eight words of 
function arguments; however, the current function actually passes these arguments via GPRs 3 
through 10. This space is only used by a called function that takes the address of one of its 
parameters. When a called function wants to do this, it first stores the parameter into the argu- 
ment passing area of the parent function’s stack frame, and then takes the address of that loca- 
tion in the stack. This is more efficient because the author of the called function knows whether 
it will need to do this, so the store is done only when needed, rather than having the caller 
always store the first arguments. 


Space beyond the first eight words is added into the argument passing area only if the current 
function will call a function that requires more than eight words of parameters. In that case, 
the extra parameter data is stored here before calling the function. Notice that this implies that 
when creating a function’s stack frame, you must know the maximum number of arguments 
required by any function called. 


The argument passing area is required and is always at least 32 bytes long. It is addressed in the 
context of the function associated with the stack frame, that is when SP points to the top of the 
current function’s stack frame (when the caller is storing arguments), or in the context of the 
calling function, that is when SP points to the top of the stack frame of the function which 
called the current function (when the called function is retrieving arguments). The argument 
passing area begins at SP + 24. 


Local Stack Area 


The local stack area is basically a scratch area for the function associated with the stack frame. 
The most common use of this area is for storage of local, or automatic, variables. 


The local stack area is optional and may be any size. It is usually addressed in the context of the 
function associated with the stack frame, that is when SP points to the top of the current 
function’s stack frame, and begins at an offset that depends on the size of the argument passing 
area. 
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There are zero to fifteen bytes of alignment padding below the local stack area, added to main- 
tain the quadword alignment of SP. 


Register Save Areas 


The register save areas are used to preserve the initial values of any nonvolatile GPRs and FPRs 
that are used by the current function. The GPR save area is above the FPR save area, and within 
each the lowest nonvolatile register used is topmost, and register 31 is bottommost. 


The GPR save area is 0 — 76 bytes long and the FPR save area is 0 — 144 bytes long. These areas 
are usually accessed in the context of the calling function, that is while the SP points to the top 
of the stack frame of the function which called the current function. The FPR save area begins 
at SP - 8 * (number of FPRs to save), and the GPR save area begins at SP - 4 * (number of 
GPRs to save) - 8 * (number of FPRs to save). 


Stack Usage: Function Prolog and Epilog 


The code that performs the pushing and popping of stack frames during a function call is usu- 
ally known as the function prolog and the function epilog. While some parts of the prolog and 
epilog are optional since some parts of the stack frame are optional, the general steps are well 
defined. Each function is responsible for constructing and destroying its own stack frame since 
only the author of a function can know exactly what parts of a stack frame a given function will 
need, or if it will even need a stack frame at all. So the prolog and epilog code is included in 
each function, are performed after the actual branch or before the returning branch, and ex- 
ecute in the context of the function’s caller, that is the SP should point to the calling function’s 
stack frame. Figure 6.2 shows the stack during the execution of the prolog and epilog. Example 
6-1 shows sample code illustrating function prolog and epilog, and stack frame construction 
and destruction. 


Function Prolog 


The function prolog code should at the beginning of a function so that it is executed immedi- 
ately following the branch to the function. At this time the SP will contain the address of the 
calling function’s stack frame. The steps performed by the function prolog are: 


M@ If the function will modify the nonvolatile condition register fields (CR 2- 4), save the 
CR into the space in the function caller’s link area (SP + 4). 

@ If the function will modify the link register, save the LR into the space in the function 
caller’s link area (SP + 8). This value is normally the return address, the address to 
branch to when the function is complete. 


@ If the function will use any nonvolatile FPRs, save them into the function’s FPR save 
area (SP - 8 * number of FPRs). 


@ If the function will use any nonvolatile GPRs, save them into the function's GPR save 
area (SP - 4 * number of GPRs - 8 * number of FPRs). 
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M@ After storing nonvolatile registers, the function may load any argument data from the 
argument passing area into these registers. 


Following the function prolog, the function’s stack frame is constructed. 


FIGURE 6.2. 


View of the stack during 
execution of prolog and 
epilog code. Grayed areas ~e eeee 
are not accessed by the 
prolog or epilog. Some areas 


shown are optional. 


these areas used 
by called functions 


Argument Passing Area 
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figure 6-2 Layout of a function's stack frame. Greyed areas are not used by this function. 
Addresses are shown relative to the stack pointer from which they are addressed. 


Function Epilog 


The function epilog code should located at the end of a function so that it will be executed 
after the destruction of the current function’s stack frame and immediately before the return 
branch. At this time the SP will contain the address of the calling function’s stack frame. The 
steps performed in the function epilog are: 


@ The values of any nonvolatile FPRs that were saved should be restored from the FPR 
register save area (SP - 8 * number of FPRs). 


M@ The values of any nonvolatile GPRs that were saved should be restored from the GPR 
register save area (SP - 4 * number of GPRs - 8 * number of FPRs). 


@ If the function will modified the nonvolatile condition register fields (CR 2- 4), 
restore the CR from the space in the function caller’s link area (SP + 4). 


M@ If the function will modified the link register, restore the LR from the space in the 
function caller’s link area (SP + 8). This value is normally the return address, the 
address to branch to when the function is complete. 


Following the function epilog, the function returns to its caller. 
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Creating and Destroying the Stack Frame 


The actual creation and destruction of the stack frame is simple. All that you need to do is to 
update the value in the SP to point to the new top of the stack, and in the case of creating a new 
stack frame you will need to store the back chain (the previous value of SP) in the link area. 
Some systems may require that the store of the back chain and the update of SP is done atomi- 
cally (in one instruction), but this requirement is easy to satisfy. The reason for this require- 
ment is that it guarantees that an interrupt cannot occur while the stack is in an invalid state. 


The stack frame creation and store of the back chain can be done with the store-word-with- 
update instruction. This instruction performs the update of SP and the store of the back chain 
atomically. For stack frames smaller than 32KB the immediate form can be used, and if the 
stack frame is larger than 32KB the X-form should be used. 


The destruction of the stack frame can be done two ways. One way is to subtract the size of the 
stack frame from SP, and the other way is to load the back chain into SP. The first method is 
faster, but it may be inconvenient if the size of the stack frame may have changed during ex- 
ecution of the function. The second method is slower, but will always work. 


Notice that since the value of SP must be quadword aligned, and that the link area and mini- 
mum argument passing areas are required, the minimum stack frame size is 64 bytes: 24 bytes 
in the link area, 32 bytes in the argument passing area, and 8 bytes of padding. 


EXAMPLE 6.1, 
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Functions Without Stack Frames: Leaf Functions 


Leaf functions, that is, functions that do not call any other functions, do not need to create a 
stack frame at all. The areas in the stack frame of the leaf function’s caller allow a leaf function 
to perform any type of operation any function would, with the exception of making a function 
call. If a leaf function without a stack frame were to call another function, that function would 
treat the stack frame of the leaf function’s caller as its caller’s stack frame, and overwrite any 
values to leaf function that may have been stored there. A good example is the LR field in the 
link area. The leaf function stores its return address there, and then calls another function that 
stores its return address in the same place. When the leaf function attempts to return to its 
caller, it will instead branch back to the location following the function call it made, creating 
an infinite loop through the end of the leaf function. 


In order to allow a leaf function to save and restore nonvolatile registers, a leaf function may 
use the register save area above its caller’s stack frame. Apple’s PowerPC System Software vol- 
ume of /nside Macintosh calls this area the Red Zone. The Red Zone is 220 bytes long, since this 
is the largest possible register save area, and any space not used for preserving registers can be 
used as a scratch area by the leaf function. 
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Argument and Return Value Passing on the Stack 


The general layout of parameters during parameter passing is as follows: 


M@ The first argument is the leftmost, and each argument smaller than 32 bits is extended 
to 32 bits. Structures are passed without changing their internal data layout. 


M@ Each word of the resulting data is passed through the GPRs and the argument passing 
area. Note that this includes floating point values. The reason is that C functions may 
take variable numbers of arguments, and this is necessary for those functions to 
process their argument list. 


M Any floating-point values are also passed via the FPRs. 


From the above rules it becomes clear that C functions can attain maximum benefit from pass- 
ing arguments in registers by moving all floating-point arguments to the end of their param- 
eter lists. 


Return values are passed back in a similar manner, that is, for integer type return values, the 
value is passed in GPR 3. For floating point return values, the value is returned in both the 
GPRs and FPRs. An additional special case is when a structure is returned. In this case the 
calling function places the address of an area in its local variable area in GPR 3. The called 
function should return the structure by storing the appropriate data into that area. Note that 
this means that if a function calls another function which returns a structure, it must allocate 
space in its local variable area for that structure. 


Interrupts and the Stack 


When an interrupt handler is invoked, it must not destroy the contents of the stack. At entry 
the handler should assume the following: 


@ SP points to the top of the stack, and the top of the stack contains a valid back chain. 
This should be true because any updates to SP should be atomic. 


™ The register save area above the stack should not be touched. This area may be in use 
either by a function prolog that is saving registers, or by a leaf function. 


So, if an interrupt handler needs to use the stack, it should perform a store-with-update, 
decrementing SP by 224 bytes, creating a valid back chain. And the handler should restore SP 


before exiting. 


Variable Size Stack Frames 


Sometimes it is useful to dynamically allocate storage in a function and have that storage auto- 
matically deallocated when the function returns. This is done in C with the library function 
alloca. The way alloca works is that it expands the function’s local stack area. 
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Example 6.2 shows one way to do this on the PowerPC. Notice that GPR 31 is used to save the 
original value of SP. This is to simplify accessing the caller’s argument passing area and the 
function’s data in the original local stack area. 


If you implement your own method of expanding the local stack area, you must remember: 


@ SP and the back chain value must be updated atomically. 


M@ Any memory beyond the register save area above SP is volatile because an interrupt 
could occur and the interrupt handler may use that memory. 


EXAMPLE 6.2. 
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Differences for 64-Bit Processors 


Operating systems and development tools for the 64-bit PowerPC implementations are cur- 
rently under development, so it’s difficult to be precise about what will happen with the sub- 
routine linkage conventions. However, since the GPRs will be 64-bits wide, the size of the GPR 
save area will be doubled, and the store-double-word and load-double-word instructions will 
be used to save and restore the volatile registers. 


When to Use the Conventions 


The register usage conventions and the subroutine linkage conventions are followed by the 
current PowerPC compilers, so they should clearly be used when you interface with compiled 
code. This includes: 

™@ Compiled code calling your assembly code. 

M@ Your assembly code calling your compiled code. 

@ Your assembly code calling library functions. 
Once you are beyond those interfaces, the only other real concern is interrupt handlers and, in 


the case of a preemptive multitasking operating system, the operating system preempting your 
program. To avoid problems with these it is best to follow the convention of the SP as GPR 1 
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and the RTOC as GPR 2, because these registers may be expected to hold those meanings by 
the interrupting code. In general, though, any interrupting software is expected to be invisible 
to the currently running software and should restore any machine state that it changes. So you 
are free to develop whatever standards are convenient for passing parameters amongst your own 
functions. This is one of the factors that can make well-written assembly code much faster than 
compiled code. 


The PowerPC Object Module Format 


The term object module refers to a program, or part of a program, which consists of code, data, 
and control information. Apple’s PowerPC System Sooftware volume of /nside Macintosh calls 
PowerPC object modules fragments. Object modules are produced by compilers and assem- 
blers. They may contain references to code or data that are not contained within the object 
module. Multiple object modules are combined together by the linker, or binder, which re- 
solves references between them. The PowerPC development tools and operating systems use a 
paradigm for object modules that is quite different from the models used for DOS and the 
680x0-based Macintosh systems. It is based on the AIX paradigm where one of the major de- 
sign goals was efficient support for shared libraries. One result is little difference between an 
executable object module and any other object module because of the capability of load-time 


binding. 


This is necessary for shared libraries because at the time a program loads, any calls to functions 
in shared libraries must be resolved. 


The PowerPC object module format is transparent to most programs written in higher level 
languages. The compilers will generate the proper code and data automatically. When writing 
your own assembly language, however, you may have to add this code and data. Most of the 
time you won’t have to. 


Non-Shared Object Modules 


Most programs are statically bound internally, that is, references to functions and data within 
the program are resolved at link time. Most programs will only make calls to the operating 
system’s shared libraries, not actually contain shared object modules themselves. 


There is no difference in how data is accessed from dynamically or statically bound object 
modules. However, both when coding functions and when calling them there are slight differ- 
ences between statically linked and dynamically linked functions. Coding a function or calling 
a function as if it will be dynamically linked will always work, and this is what compilers usu- 
ally will do. When a function is coded and called as a dynamically linked function, but the 
function is actually statically bound, there is very little overhead. Calling a function that is 
dynamically linked is somewhat more expensive. 
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Control Sections 


A PowerPC object module consists of one or more control sections, known as csects. Each csect 
is a self-contained, relocatable, piece of code or data. The csect is the atomic unit of relocation 
in an object module, that is csects may be relocated but are never merged or split. 


Within an object module the csects are organized into several groups called sections. The text 
(or code) section contains initialized read-only csects, mostly csects containing executable code, 
read-only data, and debugger information. The data section contains initialized read-write csects, 
mostly csects containing the program’s static data, the TOC, and any function descriptors. The 
bss section contains uninitialized read-write data, mostly program data csects. 


SECTIONS AND SE NTS 





When writing assembly language you define the csects that contain your code and data, but 
you must remember that they may be relocated relative to each other. So you must write your 
code to be position independent when making references between csects. The main mecha- 
nism for doing this is the TOC. 


The TOC 


The TOC (table of contents) is a special csect that contains the addresses of other csects. There 
is exactly one TOC for each object module, and when object modules are statically linked their 
TOCs are merged. When object modules are bound together dynamically they retain separate 
TOCs. Counter to the name, a module’s TOC contains addresses that csects in that module 
need to use, rather than a list of addresses within the module. Programs use the TOC for two 
primary purposes, gaining addressability to data and making function calls to dynamically bound 
functions. 


Because csects use the TOC to find the locations of other csects, and the TOC itself is a csect 
which could be located anywhere, there is a sort of “chicken and egg” problem. So, by conven- 
tion, GPR 2 (the RTOC) is dedicated to holding the address of the current TOC. All software 
should maintain this, although code in functions will rarely ever have any need to set the RTOC. 
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Entries in the TOC consist of a TOC entry name and one or more expressions, usually a csect 
symbol or a label. If a TOC entry contains only a single expression, it may be combined into 
other TOC entries with the same name, if the expressions reference the same csect. This com- 
bining can potentially change the semantics of your code, for example, causing you to load the 
wrong address from the TOC, so it is best to choose unique names for all TOC entries. 


UNDERSTANDING LABELS 


The assembler essentially treats a label as a symbol for a csect symbol plus an offset. A 
csect symbol represents the 32-bit or 64-bit effective address of that csect, so a label 
also represents a 32-bit or 64-bit address, and since csect symbols are relocatable, labels 
are relocatable. 


The important exception to this rule is that a label in the TOC csect is always a symbol 
for that label’s offset from the top of the module’s TOC. When modules are statically 
linked their TOCs are merged, and the linker will update the values of any TOC 
labels. | a | 


Also, for relative branch instructions, the assembler will implicitly subtract the address 
of the branch instruction from the value of the symbol specifying the target address. 


Accessing Data Using the TOC 


You must use the TOC to find the addresses of any data your code uses from the data and bss 
sections. Example 6.3 shows a simple example of addressing static data. The address of a counter 
variable is loaded from the TOC using a TOC label. This example shows why TOC labels are 
always relative to the top of a module’s TOC—because the TOC is accessed using the RTOC 
as a base register and an immediate offset. If TOC labels represented effective addresses, every 


TOC access would have an immediate field like “LABEL - TOC[tc0]”. The symbol “TOC[tc0]” 
represents the top of the current module’s TOC. 





EXAMPLE 6.3. 


Function that increments a static counter variable. The TOC is used to find the 
address of the counter in the data section. 


.csect .incctr[PR] | 
.GQlobl .incectr[PR] oe : i 
lwz Po, TECPURTIOO) =. - # get address of counter 


| | # (Tetr is rel to TOC) 
lwz r4,0(r3) ... # Load counter value 
addi pac.)  # increment counter 
stw ae | # store counter value 
bir | : : 


: # return 
.csect _ctrdata[RW] | 
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Why can’t the label in a data csect, or even a csect symbol, be used directly? The main reason 
is that there is no way to code a reference like this in the PowerPC architecture. Since all im- 
mediate fields on arithmetic and load/store instructions are 16 bits, and a label represents an 
effective address, the closest you could come to coding this would involve constant expressions 
for breaking up the label’s value into 16-bit pieces, and these type of expressions would require 
that the expression be preserved and evaluated by linkers and binders whenever relocation oc- 
curred. 


Example 6.4 illustrates an important way to increase the efficiency of using the TOC. Rather 
than having TOC entries for every data label, you can load a register with a TOC entry and 
index from that address. To calculate the immediate fields you can use the difference between 
a label and the base. This approach is more efficient because only one load of a TOC entry is 
necessary. The .using pseudo op is used as a convenience. It tells the assembler to implicitly 
calculate the immediate field based on the value given in the .using statement, and to use the 
register given in the .using statement as the base register for the instruction. The .using state- 
ment does not actually put the value into the register, it tells the assembler to assume that it is 
there. You must properly load the base register in your code. 


EXAMPLE 6.4. 
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FUNCTION ENTRY POINTS 


_ The symbol used to refer to a function is known as the entry point. Function entry 
points may be csect names or labels. If an entry point is a csect name, execution will 
begin at the first instruction in the csect. 

_ All entry point names must start with a period. Compilers will prefix function names 

with the period automatically, but the assembler will not. So the C library function 
printf would be referred to as .printf in assembly code, and an assembly function 
.myfunc would be called as myfunc from a compiled language. 


Statically Bound Functions 


Coding and calling statically bound functions on the PowerPC are very similar to performing 
function calls on the x86 or 680x0 architectures. Basically all that you need to do is to branch- 
and-link to the function’s entry point. If the function is contained in other module, that mod- 
ule must declare the function global with the .globl pseudo op, and in the calling module the 
entry point must be declared external with the .extern or the .globl pseudo op. Example 6.5 
illustrates this for statically bound functions calling between modules. 





EXAMPLE 6,5. 
Definition and calling of statically bound function .getxer from function .addxer. 


#HHHHHHH file getxer.s ##HHHHHH 
| .csect .getxer[PR] 

.globl .getxer[PR] 
mfxer r3 # r3 <- xer 
br # return 

H##HHHHH file addxer.s ####HHH#F 
.extern .getxer 
.csect .addxer[PR] 
-globl ,addxer[PR]} 


mfir r12 # r12 <- return address 

addc r3,r3,r4 # add first two arguments 
2 .getxer # call func to get the xer value 
mtir rig. # Ir <- return address 

bir : # return 


Shared Object Modules 


Dynamic binding mostly affects the definition and calling of functions. Access to data is still 
done via the TOC. Defining and calling dynamically bound functions basically involves add- 
ing an extra piece of data for each function defined, and an extra instruction at each function 
call. 
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Execution of Dynamically Bound Function Calls 


When object modules are dynamically bound, they are not merged together as in static 
linking. One reason for this is that shared libraries could not share a single image of their text 
section if every image had to be customized. This implies that each object module which is 
dynamically bound will still have its own TOC, since the object modules are not merged. 


Functions in a sharable object are associated with a data structure called the function descriptor. 
Each function in a sharable object module must have a function descriptor. The descriptor 
contains the address of the function’s entry point, and a pointer to the TOC of the module. 


Calls to dynamically bound functions are made through glue or glink functions. These func- 
tions find the function descriptor for the called function, set the RTOC, and branch to the 
called function. When the function returns, it does not return to the glink code; instead it re- 
turns directly to the original caller. Because the RTOC is set to the TOC of the shared object 
module when the return branch is taken, the calling function must restore the value of the 


RTOC. 


Figure 6.3 shows program myprog which calls functions from the shared library libfoobar after 
libfoobar has been loaded and bound with myprog. References to functions in libfoobar have 
been transformed into references to glink code, which uses the myprog TOC to locate the 
function descriptors for foo and bar. The function descriptors give the TOC and entry point 
for the functions. The call to myfunc from main was resolved statically, so it is not made via a 
glink routine. Note that all references to libfoobar from myprog are located in the TOC, so the 
code in myprog is correct and complete whether or not libfoobar is loaded. 


FIGURE 6.3. ie Eat Vymet You fect Gneotics Idle Water ine ee 
Shared library libfoobar has 7 es 

been dynamically bound "Pha aneass 

with program myprog. *** 6PPD3 **# 
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Caller’s Local Stack Area 
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figure 6-3 View of the stack during execution of prolog and epilog code, Greyed areas are not 
accessed by the prolog or epilog. Some areas shown are optional. 
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Following the thread of execution of .main’s call to .bar: 


@ main calls .glink_of_bar. 

.glink_of_bar gets the address of .bar’s descriptor from myprog’s TOC. 

.glink _of_bar sets the RTOC to point to libfoobar’s TOC using .bar’s descriptor. 
.glink_of_bar branches to .bar using .bar’s descriptor to locate .bar. 


-bar executes 


.bar returns directly to .main 


@ main restores the RTOC to point to myprog’s TOC. 


Example 6.6 shows the code for .glink_of_bar. Notice that the link register is not set in the 
branch to .bar, as is usually done for a function call. Thus, when .bar performs its return branch, 


it will return directly to .main. 





EXAMPLE 6.6. 

Co wglin bok ka TharED i is allied on nthe TOC entry containing the 
dress of ‘bar's function descriptor. oe 

ie .osect .glink_of_bar[GL] | | : | 

 lwz ss r12, TbarFD(RTOC) ~ #72 <- bar func desc 

ae. -RTOG, 20(SP)  # store TOC in link — 

ee Ce a ,0(r12) # r@ <- .bar 

— lwz RTOC 4(r12) # RTOC <- bar TOC 

- mtctr re an eetr<- ver 

Ot a i eee 


There is one glink routine for each dynamically bound function referenced, per module. This 
is because the offset into the TOC of .bar’s file descriptor is hardcoded into the glink routine 
by the use of the label TbarFD. 


The glink routine destroys the values in GPR 0 and GPR 12. This is fine for functions that are 
strictly following the register usage conventions, as these registers are volatile and not used to 
pass function parameters. However, if you write your own dynamically bound assembly lan- 
guage functions with your own calling conventions, you will need to avoid these registers. 


A slight variation on the glink routine is shown in Example 6.7. The ptrgl function takes a 
function pointer in GPR 11 rather than using an entry in the TOC. Recall that function pointers 
are actually pointers to function descriptors. Note the use of GPR 11 as the environment pointer. 
The ptrgl code will restore any value saved in the third word of the function descriptor into 
GPR 11. This allows languages that make use of the environment pointer, such as PASCAL, 
to set it’s value. In languages that do not use the environment pointer, such as C, ptrgl will put 
garbage into GPR 11, but it is considered a volatile register anyway. 
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EXAMPLE 6.7. 





Coding Calls to Dynamically Bindable Functions 


Coding calls to dynamically bindable functions is really very easy. The only difference between 
this and calling any other function is that a special no-op instruction is placed immediately 
following the branch. The binder will replace this instruction will an instruction which will 
restore the TOC of the current module if the called function is actually going to be called via 
glink code. The binder will also replace the target of your calling branch with the location of 
the glink code if it is needed. 


You also do not need to write the glink code or create the function descriptor entry in the TOC. 
The linker will do both of these for you. Example 6.8 shows a function myfunc which calls a 
dynamically bindable function foo. 


EXAMPLE 6.8. 





Coding Dynamically Bindable Functions 


Coding a dynamically bindable function is also very simple. The function itself is coded just 
like any other function, and the function entry point is exported. 
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You do need to add the code to generate the function descriptor, and you should export the 
descriptor’s symbol. Example 6.9 shows a dynamically bindable function xor64. Note that the 
use of .toc is required in the module in order to use the TOC[tcO] symbol. 


EXAMPLE 6.9, 





"Code for a dymamically bindable function, . 


.globl 'xoré4[PR] . 

.globl xor64[DS] | : 
.CSect .xOreéairn, =... : | : 

xor ro,r3.75 re <- 73 xor rs (high words) 


xor r4,r4,r6 # 74 <- r4 xor ré caow words) 
blr oo eee return 
. toc —# .toc csect req' d for TOC[ tc} 


.csect xor64[DS]  # function descriptor — 
.long  .xor64[PR] # entry point © 
.long TOC[tcO] # module's TOC 


Linking For Dynamic Binding 

In order for the linker to distinguish between symbols that will be resolved via dynamic bind- 
ing and those that are actually unresolved, when you link you should use an import list. The 
import list contains a list of symbols that the linker will assume are present at run time. The 
import list tells the linker to generate the glink code and to create TOC entries for function 
descriptors. The import list also tells the linker which object file contains the symbols, so that 
a list of object files that are required by the program can be generated. This list is used to load 
all the proper object files when the program is run. 


Creating A Shared Object File 


You can create an object module that will act like an import list when linked with a program. 
To do this you use an export list, which is the same as an import list, except that you use it 
when linking the object files to be dynamically loaded. 


Object File Formats 


It is the job of the operating system and development tools to read and write object modules to 
files. So most of the time you will never have to worry about the actual file format used to store 
an object module. 


_ Historically the PowerPC object modules have been stored in XCOFF (extended common object 
file format) files. The use of XCOFF files comes from AIX which uses this format exclusively. 
XCOFF is an extension to the standard UNIX object file format COFF which was formalized 
in System V, although it was based on the BSD a.out format. 
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Apple System 7 supports XCOFF files, as well as a newer format PEF (Preferred Executable 
Format), which is better suited to the Macintosh environment. Although XCOFF files are 
supported, any programs which depend on UNIX specific shared libraries or system calls will 
not execute correctly under System 7. Programs written for the Macintosh should use the PEF 
format. 


Macintosh Operating System Considerations 


The Macintosh Operating System on PowerPC-based Macintosh computers is designed to be 
backward compatible with software written for 680x0 based Macintosh computers. In fact many 
parts of the Macintosh OS are still 680x0 code. The basis for this capability is a 68LC040 
emulator and a part of the operating system known as the Mixed Mode Manager, which switches 
between PowerPC (or native) mode and the 68LC040 emulator. For the most part this is trans- 
parent, but it does complicate programming in some cases, and because 680x0 code must run 
unchanged, the burden falls to the PowerPC code. 


When performing a cross-mode call, the Mixed Mode Manager needs to know what architec- 
ture set the function uses so that it can switch to the correct mode. There are two cases where 
this problem arises most often: calls to operating system functions, and the passing of function 
pointers to functions of unknown architecture, especially system functions. Note that there 
are other scenarios involving cross-mode function calls that may or may not be handled trans- 
parently by the Mixed Mode Manager—you should consult /nside Macintosh for more details. 


Calls to the operating system functions are handling by inserting glue code when the routine is 
dynamically bound to the calling application. This glue code can perform whatever steps are 
necessary to switch to the proper mode for the operating system function. For calls from 680x0 
code, this is handled by the trap dispatch mechanism in the 68LC040 emulator. 


The problem of passing function pointers is handled by using universal procedure pointers in 
the place of function pointers. A universal procedure pointer (UPP) is defined as either a pointer 
to 680x0 code, or a pointer to a routine descriptor. The routine descriptor is a structure which 
contains information about the function, including the calling conventions, instruction set 
architecture, and the actual function pointer (which is really a pointer to a function descriptor, 
in the case of a PowerPC function). The first element of a routine descriptor is a 680x0 in- 
struction which causes a trap into the Mixed Mode Manager. You should use a UPP rather 
than a normal function pointer in cases where you are providing a function pointer to be called 
by external code whose architecture is not certain. In addition, new versions of Macintosh sys- 
tem functions that expect UPPs should be used. 


An additional scenario exists which can commonly occur on the Macintosh: When a code re- 
source is invoked, often this is performed by simply branching to the beginning of the resource. 
To allow the Mixed Mode Manager to intervene, you should place a routine descriptor for 
your function at the beginning of the resource. 


Performance Tuning and 
Optimization 
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The earlier chapters in this book covered the nuts and bolts of the PowerPC architecture and 
instruction set. You might be wondering about the story behind some of the features and fa- 
cilities of the machine, why they were provided, and when you have several options to imple- 
ment a particular function in your program, which is the best one to choose. This chapter covers 
some of these issues, and shows you how to best express your program in the language of 
PowerPC instructions. 


Speeding Up an Application Written ina 
High-Level Language 


Most applications are far too large to even talk seriously about wholesale rewriting to address 
performance problems. In most cases, performance can be improved significantly through rela- 
tively simple incremental improvements. 


When setting out to improve performance of a program, it is important to follow a strategy. 
It’s all too easy to spend hours of programming effort chasing too little performance gain; per- 
formance tuning work expands easily to fill almost any available time. Before starting, it’s worth 
investing some time-setting goals and deciding to what lengths you are willing to go to achieve 
these goals. Have a plan, set goals, and stay focused. 


Using Development Tools 


You need a suite of development tools that should include the following: 


MA high-level language compiler. You will want to do most coding in a high-level 
language, because it’s usually quicker and easier to develop that way. 


MA PowerPC assembler. You will want an assembler to do machine language coding. 
Some compilers provide an inlining feature that enables you to code assembly lan- 
guage instructions within a high-level language procedure. This capability can be 
handy, but it can become awkward for larger coding projects. 


MA debugger. Unless you write perfect code the first time—every time—you probably 
will want a debugging tool that enables you to step through assembly code and 
examine or modify the machine state. 


@ A profiler. This is one of the most important additions to the performance-minded 
code developer’s toolbox. A profiler enables you to instrument an application to 
discover critical information, such as how often functions and procedures are called 
and which sections of your code are using the most CPU time. 


Ordinarily, most or all of these tools are included with an operating system or as part of a soft- 
ware development system. If you’re using AIX for the PowerPC, for example, the IBM XL 
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compilers, an assembler, a linker, a profiler, and several debuggers are available. On the Power 
Macintosh, Metrowork’s CodeWarrior system provides most of this, though it doesn’t include 
an assembler. 


Another very important tool is a means of measuring and monitoring actual performance. 
Depending on your development environment, this might be more of a problem than it first 
appears. The next section covers some of the problems and pitfalls of performance measure- 
ment, and what you can do about them. 





DEVELOPING A PERFORMANCE STRATEGY 


/ Tools are an important component t of any strategy, bon for identifying optimi- 
zation opportunities and pursuing them. If your program is written in a high- 
level language, use the highest quality, industrial-strength optimizing compiler _ 
you can get. Program profiling can be crucial to understanding where your 
program spends its time (you may be surprised). Profiling is the best way to. 
ensure that you spend your time wisely and get the most performance benefit for 
your work, Use profiling to identify key re routines or _— in the code where — 
speedup is most critical. 


M Examine the algorithms used in these critical areas. Make sure that the problem 
is, in fact, not a programming error or poor algorithm choice. You might be able 
make a bubble-sort slightly faster with performance tuning, but a switch to a 
better-behaved algorithm like quicksort is almost always a better tactic. Tf your 
program spends a lot of time searching, for example, consider techniques like 
hashing, balanced trees, or index inversion before recoding. 


Isolate critical code in a module by itself and try playing with compilation _ 
options. Some code responds better to some optimization strategies than others, 
and most compilers enable you to selectively enable and disable particular — 
optimizations. 


@ Sometimes, the program appears to be spending huge dink of its time in 
| library or system routines. This can be discouraging, because often you cannot 

exercise any control over these routines. At the same time, library code might be 
more general in scope than what is called for in your application. You may be — 
able to replace some library or system calls with routines of your own that — 
perform much more efficiently. An example of this s.dbel later in this chapter 
in the “Putting It All Together” section. | 

- f@ When you recode a routine in assembler, trials sure diss you are building an © 
exact replacement. Also, carefully document your work, being sure to include 
the code it replaces as the specification. _ | 
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Measuring Performance 


In your quest to improve performance, you need a way of checking your progress. There are 
many ways to measure performance with varying degrees of precision, accuracy, and cost. If 
you have access to a laboratory reference system with direct instrumentation of the hardware 
for performance measurement and diagnostics, you are fortunate indeed. This kind of system 
is truly the hot setup, and can provide invaluable insight into what really is happening inside 
the box. The bad news is that such systems are terrifically expensive, and few developers can 
justify the cost. Similar information sometimes can be obtained by using a software timing 
simulator, which can be less expensive than instrumented hardware, but this still is not neces- 
sarily cheap. Also, because software simulation is so much slower than real hardware, the size 
and scope of the workloads you can measure are limited. 


Fortunately, such precision isn’t usually necessary, and it’s possible to do performance work 
with tools you probably already have. Your operating system may have facilities for measuring 
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program execution. ‘The UNIX time command, for example, reports the system and user time 
used during an execution of a program. If an OS measurement tool isn’t available, you might 
be able to build your own using the processor’s own time base registers. However, it’s not al- 
ways necessary to resort to complicated hardware or software performance measurement tools. 
In many cases, you are interested in performance because your application is making you wait, 
and your wristwatch may provide all the precision you need. 


Regardless of the technique you use to measure performance, be careful to control the condi- 
tions under which you conduct your experiments. It can be very difficult to get reliable mea- 
surements if your machine depends on a network file server or receives a fax in the middle of 
your timed runs. The best measurements are consistently repeatable timings that don’t vary 
more than a few percent from run to run. 


Examining PowerPC Coding Strategy and 
Instruction Selection 


Particular features of any architecture sometimes favor one coding style or approach to a pro- 
gramming problem over another. A list of the available registers and machine instructions doesn’t 
necessarily communicate how the designers intended the machine to be used. Machine lan- 
guage programmers are well aware that different code sequences that compute the same result 
can yield quite different measured performances. There may be many ways to solve a particu- 
lar problem, but often one is better than another, and that better way was provided by the 
machine designers expressly for that purpose. The PowerPC instruction set, for example, pro- 
vides several types of conditional branch instructions, and several types of load and store ad- 
dressing. In this chapter, you will learn the rationale for some of these features and how they 
can be used to extract maximum performance from your PowerPC programs. 


In some cases, these features improve performance by reducing the number of instructions 
required for a computation. Some optimizations, however, improve performance not by re- 
ducing the number of instructions per se, but by exploiting particular ways in which PowerPC 
processors are implemented in hardware. All the currently available PowerPC implementations 
are pipelined superscalar machines, meaning that each is capable of working on the execution of 
several instructions in parallel (see the following sidebar). All implementations can issue and 
complete at least three instructions per clock cycle. This execution capability is impressive, but 
limits—of the processor, the system, or your program—can prevent sustaining this peak ex- 
ecution rate for extended periods. Some of these limits are unavoidable; others, however, can 
be minimized through careful instruction selection, programming style, and code organization. 


133 


134 Part III W@W Programming in Assembly 


PIPELINED AND SUPERSCALAR EXECUTION 
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Using Branches and Loops in PowerPC Code 


Branch instructions provide the mechanism for altering the simple sequential flow of control 
in the execution of a program. The PowerPC architecture has several forms of conditional and 
unconditional branch instructions, and the way the hardware executes the different forms of 
these instructions has implications for which is best for a given programming situation. 


To understand this concept, look at what happens when a branch is encountered. In any 
pipelined processor, a conditional branch instruction must pass through several stages of the 
execution pipeline before the hardware can tell which instruction should follow the branch. By 
that time, some of the instructions at sequential addresses after the branch already may have 
proceeded through several stages of the execution pipeline. If the branch is taken (control is 
transferred off the sequential path), execution of the partially processed instructions after the 
branch must be abandoned and cleared out of the pipeline, and the fetch and execution of 
instructions at the branch target location must be initiated. Thus, the steady flow of instruc- 
tions through the pipeline is broken. The length of time from the point at which the branch 
takes effect until the pipeline flow resumes at the target address is called branch latency. 
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Because branches are so frequent, this could create a very serious performance problem. In most 
programs, 15 percent or more of the instructions can be some form of branch. To avoid the 
loss of such a large number of processor cycles to branch latency, the PowerPC architecture has 
several important features to streamline the processing of branch instructions. 


Branch Resolution and Prediction 


By looking ahead in the instruction stream, it is possible for the processor to identify branch 
instructions early, and begin processing them several cycles before they would be executed in a 
strictly sequential fashion. This look-ahead hardware makes it possible to eliminate completely 
the latency of many branch instructions. This capability depends on being able to correctly 
determine whether a given branch instruction causes a transfer of control, and it must be pos- 
sible to make this determination before completely executing all the instructions preceding the 


branch. 


For unconditional branches, this is easy, because control is always transferred. The fetcher can 
discard safely the sequential path following the branch, and begin prefetching at the branch 
target as soon as the look-ahead hardware spots the unconditional branch. Things are not so 
simple for conditional branch instructions, because any instruction preceding the branch could 
affect whether it is taken. Only after the results of operations the branch depends upon are 
available can the branch can be treated as unconditional. Until then, the processor can wait for 
the necessary information or predict which way the branch will go (possibly fixing things up if 
the guess was wrong). 


All current PowerPC implementations prefetch the instruction stream, execute unconditional 
branches early, and use a prediction mechanism to remove pipeline bubbles after conditional 
branches. In addition, the PowerPC architecture provides several mechanisms that enable the 
programmer to help the processor in the branch resolution process and improve program per- 
formance. 


Using Branch Unit Registers 


The PowerPC register set includes a number of special purpose registers designed to hold branch 
targets, loop counters, and branch conditions. These special registers enable some hardware 
speedups by simplifying tasks like determining which instructions currently in execution can 
affect the outcome of a conditional branch. These branch registers were described in Chapter 
4; in this chapter, you will learn how to use them most effectively. 


Using the Condition Register (CR) 


The PowerPC condition register provides a powerful means of controlling conditional branch- 
ing. It not only allows multiple condition codes, but also logical operations among condition 
code values. Descriptions of this facility sometimes make it seem more complicated than it re- 
ally is, so let’s get it straight. 
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The PowerPC condition register (CR) is a single register, 32 bits wide. A conditional branch 
instruction can be coded to test the binary value of any one of the 32 bits, and branch if it is 
one or if it is zero. However, the PowerPC comparison instructions treat the condition register 
as if it were divided into eight four-bit condition codes, or CR fields. A comparison instruction 
writes all four bits of one of these CR fields. The first three bits indicate less than, equal to, and 
greater than relations between the values being compared; the fourth bit is overflow. 


Don’t worry, you don’t have to keep CR bit numbers straight—PowerPC assemblers provide 
extended mnemonics to make CR field selection for comparisons and the bit testing for branch 
instructions easy and readable. Example 7.1 shows PowerPC instructions to (1) add the values 
of registers r4 and r5, (2) compare the sum to zero, and (3) branch to label] if the sum is equal 


to zero. 





EXAMPLE 7.1. 


ee 
a Pe 
beg ery abelt (3) 


It turns out that “comparison with zero” is a frequently needed operation; it also turns out to 
be fairly easy to implement in hardware. Some PowerPC instructions provide the ability to do 
a comparison with zero “for free,” eliminating the need for an explicit comparison instruction. 
This capability is called the record feature, and is available on many PowerPC integer and float- 
ing point instructions. The record feature causes condition field zero (condition field one for 
floating point operations)to be set with the four-bit condition code generated by comparing 
the instruction’s result to zero. This implicit comparison is specified by apending a dot (.) to 
the instruction opcode. Example 7.2 shows how the record feature on the PowerPC add in- 
struction can be used to eliminate the comparison instruction from Example 7.1: 









Implicit comparison using the record feature. — 
beq  cr0,label1 


The PowerPC instruction set also provides a set of logical instructions that enables you to com- 
bine multiple conditions and use a single branch instruction. The code sequence in Example 
7.3 branches to an error handler if the value in register r3 is outside the range {1,2,3,...,25}. 


Eo Fe 
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EXAMPLE 7.3. 


PROGRAMMING WITH THE CONDITION REGISTER 





Using the Count Register (CTR) 


The PowerPC branch unit architecture provides a second conditional branch mechanism via 
the count register (CTR). The count register was designed specifically to control loop iteration 


in a way that makes it easy for the branch look-ahead hardware to predict the outcome of a 
CTR-dependent conditional branch. 


Constructing a loop using the CTR can be as easy as moving the loop iteration count from a 


GPR to the CTR, and closing the loop with a CTR-based branch instruction that is taken 


Chapter 7 Wa Performance Tuning and Optimization 


whenever the decremented value in the count register is greater than 0. Example 7.4 illustrates 
a simple loop structure using the count register. 







EXAMPLE 7.4. 
Loop using the count register, oe 
| : “mtctr pe oy itera . 
 loopt: a 
sleep body= een 
bdnz ._—ssAo0p 4 heanch 


This mechanism is less general than branching based on the condition register, but it is suffi- 
cient to express many simple loops. Because only conditional branch instructions (and the mtctr 
instruction) can alter the CTR, in simple loops it is easy for the hardware to determine well in 
advance of executing the instructions in the loop body whether the loop-closing branch in- 
struction will be taken. Because the CTR always is decremented, some extra care (and instruc- 
tions) might be needed if the loop iteration count might be 0. 


Naturally, it is possible to terminate the loop from within the body with other condition reg- 
ister-based conditional branches. It almost always is a bad idea to modify the CTR within the 
loop (with a move to CTR instruction), because this interferes with the hardware’s capability 
to resolve the branch in advance, possibly nullifying the performance advantage of using the 
CTR. The instruction set also enables branches to depend on both the count register and the 
condition register, making it possible to combine multiple-loop exit conditions in a single branch 
instruction. However, reading the condition register could delay resolution of the branch com- 
pared to a simple count register test. 


Some programming situations, such as jump tables, call for computing a branch target address 
in a general register. The PowerPC instruction set doesn’t permit branching directly to an address 
contained in a general register; the address must first be moved to the count register or link 
register (with the mtlr or mtctr variants of the mtspr instruction). It’s a good practice to reserve 
the use of the link register ( and the belr instruction) for subroutine linkage (that is, when the 
branch is a call or return), and the count register (with bcctr) for computed go-to’s and jump 
tables. Not only is it good style, but it can improve performance; the PowerPC 620 processor 
has a hardware link register that can speed up return from function calls if this convention is 


followed. 


Looking At PowerPC Branch Prediction 


Each PowerPC processor implementation has a mechanism for predicting the execution path 
beyond conditional branch instructions. As long as the predictions prove correct, there can be 
an uninterrupted flow of instructions through the execution pipeline, and no branch latency 
performance penalty. 
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The PowerPC 601 and 603 processors both implement a similar branch-prediction scheme. 
The mechanism used by both these chips is static, because the processor does not “remember” 
how a particular branch instruction behaved earlier in the execution of the program. The 
PowerPC 604 and 620 processors both implement dynamic branch prediction, and have on- 
chip memory dedicated to keeping track of how a particular branch instruction behaved in the 
past. This information is used by the hardware to predict whether the branch will be taken if it 
is encountered again during the execution. 


Static Prediction in the PowerPC 601 and 603 Processors 


The static branch prediction scheme implemented in these PowerPC processors is based on 
the type of branch instruction and the direction of the branch. When a branch instruction is 
executed, the processor issues a fetch request for a new block of instructions. If the branch is 
conditional, and it cannot be immediately determined whether the branch will be taken (be- 
cause the condition depends on data that is still being computed), the processor guesses which 
way the branch will go, and begins fetching along that guessed path. 


The possibility of an incorrect guess means that the processor must be prepared to abort the 
execution of any speculatively executed instructions, and resume execution along the correct 
path. The inability to commit the results of the speculative execution limits the number of 
instructions that can be executed along the guessed path, and 601 and 603 processors both are 
limited to execution past a single predicted branch. 


The branch-prediction scheme used for unresolved conditional branches by the 601 and 603 
processors is simple. If the conditional branch is to the address contained in the link register 
or count register (a bectr or belr instruction that tests a bit in the condition register), the 
branch is predicted taken. Conditional branches are predicted based on the direction of 
the branch; the branch is predicted taken if the branch target address is less than the address 
of the branch instruction, and it is predicted not taken if the branch target address is greater 
than the branch instruction’s address (the hardware simply tests the sign bit of the address 
offset in the branch instruction). This convention is illustrated in Example 7.5; backward 
branches are predicted taken, and forward branches are predicted not taken. 


EXAMPLE 7.5. 





This scheme corresponds to the way branches are most often used. A branch to a negative off- 
set is commonly the closing instruction of a loop, and therefore usually would be taken. A branch 
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with a positive offset cannot be predicted as accurately, but in many cases these branches are 
used to exit from loops or to escape to special-case and exception-handling sections of the code, 
and are less likely to be taken. 


These rules are based on broad generalizations, and really just exploit particular coding styles 
and compiler conventions. It is not difficult to imagine programming situations where these 
static prediction rules would be incorrect. If you know that the branch is most likely to behave 
opposite to this prediction convention, it can be reversed by setting the Y bit in the BO field of 
conditional branch instruction. Rather than worry about whether to set the Y bit, you can in- 
dicate explicitly which way the hardware should predict the branch. Appending + to the branch 
mnemonic tells the hardware to predict the branch taken, and — tells the hardware to predict 
not taken. The assembler then makes sure that the Y bit is set appropriately, based on the sign 
of the branch offset. Example 7.6 shows how to force to a forward branch, which is normally 
predicted not taken, to be predicted taken instead. 


EMAMERE 7th 








Dynamic Branch Prediction in the PowerPC 604 and 620 Processors 


Instead of fixed prediction rules, the PowerPC 604 and 620 processors both have hardware for 
dynamic branch prediction. The hardware maintains a branch history table, indexed by bits of 
the address of the branch instruction, that tracks how that branch has behaved earlier. When a 
branch instruction is identified by the look-ahead hardware, the branch history table is con- 
sulted. Because most branch instructions will behave in the future much as they have behaved 
previously, the history table can yield much more accurate predictions than the static rules. 
This is not to say that the prediction algorithms always are correct. However, unlike static pre- 
diction, there’s no need to worry about coding the Y bit properly. 


Instruction Reordering 


One of the guiding principles in the first generation of RISC architectures was to keep the 
instruction set simple and uniform, permitting the greatest flexibility in reordering instruc- 
tions to avoid stalls and keep the pipeline flowing smoothly. In the first RISCs, all this reorder- 
ing was done by instruction-scheduling compilers or assemblers prior to running the program. 
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The simplicity of RISC instructions also makes it easier to build hardware that does this kind 
of scheduling dynamically, as the program is executed. One advantage of this approach is that 
it provides some relief from the necessity of rescheduling code for different implementations 
of the same architecture that have different optimal scheduling policies. All the current PowerPC 
implementations provide some capability to execute instructions out of program order. This 
feature reduces, to some extent, the degree to which the hardware depends on static instruc- 
tion scheduling to achieve good performance, and makes it possible to write programs that can 
perform well across a large family of PowerPC implementations. 


Therefore, although there are notable differences among PowerPC implementations in their 
capability to dispatch and reorder instructions, it is not necessary to understand these differ- 
ences in painstaking detail in order to write PowerPC code that achieves good performance. In 
fact, excellent performance usually can be achieved by following a few simple rules: 


Separate compares from dependent branches. 
2. Separate uses from loads. 


3. Interleave operations of different functional units (integer, branch, load-store, floating 
point). | 


The first rule is to help reduce branch misprediction penalty cycles. PowerPC processors look 
ahead in the instruction stream and predict the outcome of branches to keep the execution 
pipeline full. This capability is limited, however; even if the direction of a conditional branch 
is predicted correctly, the hardware can proceed only so far down the predicted path before it 
has to wait for confirmation of the prediction. If the branch outcome is not predicted correctly, 
the sooner the processor finds out, the sooner the misprediction recovery process can be started. 
Therefore, where possible, use a branch that doesn’t depend on the CR, but if this is unavoid- 
able, compute the CR value for a branch as early as possible. 


The second rule suggests that when you load a value from memory, try not to use that value in 
an instruction immediately following the load; instead, place the load instruction earlier in the 
sequence, or in the instruction that uses the loaded value earlier. 


Following the third rule helps the dispatcher maximize the number of instructions that can be 
executed in parallel. PowerPC processors can, to varying degrees, reorder instructions on the 
fly, but the dispatcher can choose from only a few instructions at any given time. Each PowerPC 
implementation uses different algorithms and functional unit instruction groupings, so if you 
are writing code for a specific processor target and want very precise details of the scheduling 
model, consult the instruction timing section of the User’s Guide for that processor. 


Programming with PowerPC Memory Instructions 


The PowerPC load and store instructions provide a powerful and flexible interface to proces- 
sor memory. The instructions support several addressing modes—the number of ways an ad- 
dress can be formed. Different PowerPC instructions enable an address to be expressed as a 
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constant, the contents of a register, the sum of a register and a constant, or the sum of two 
registers. Indexed (sum of two register) instructions also can be coded to automatically incre- 
ment the address index register to the next sequential memory unit. 


Instructions are provided to support access modes of byte, halfword, word, and doubleword. 
Several block-mode memory instructions are defined to save or restore multiple general regis- 
ters and manipulate strings. 


PowerPC memory is byte-addressed. Unlike some other RISCs, all the standard PowerPC load 
and store instructions require no alignment of the effective address. However, depending on 
the processor, there can be a significant performance penalty for an unaligned memory refer- 
ences. Best performance is provided when the alignment of a reference matches its natural mode 
(doublewords should be aligned on 8-byte boundaries, words on 4-byte boundaries, and 
halfwords on 2-byte boundaries). 


Two principal factors determine the precise performance penalty for an unaligned access. The 
worst case is when the target data requires two cache accesses in order to complete the refer- 
ence. Unaligned references that do not span a doubleword boundary are tolerated gracefully 
by all PowerPC processors, usually requiring one additional cycle compared to an aligned ref- 
erence. Because misalignment can effectively cut load-store bandwidth in half, however, it is a 
good idea to align accesses whenever possible. 


POWERPC LOAD/STORE MULTIPLE AND STRING INSTRUCTIONS 
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Putting It All Together 


By now, you probably have some ideas about writing good PowerPC code. This section crys- 
tallizes some of these ideas in an optimization problem that explores alternative formulations 
of a simple programming task. 


Suppose that while profiling your application, you notice that a huge amount of CPU time is 
spent in a library routine that copies strings (by the way, be prepared—discovering that some 
mundane task is chewing up loads of CPU time is not unusual). You discover that the library 
routine is a bulletproof general-purpose string copy routine that is coded inefficiently with special 
case checking and other features that your application doesn’t need. It’s possible that a custom 
recoding without all the bells and whistles could save a significant amount of CPU time. 


The code fragment shown in Example 7.7, bytecopy], is a naive first cut of a simple assembly 
language string copy routine. Input to the routine consists of a pointer to the first byte in the 
source string, a pointer to the first byte of the destination, and the number of bytes to copy. 


EXAMPLE 7.7. 





In this sequence, the first two instructions copy a byte from the source to the target, and the 
next two instructions increment the addresses to point at the next byte locations. The last three 
instructions decrement the iteration count, test it for zero, and branch back to the top of the 
loop. 


Using Indexed Load and Store 


Three of the seven instructions in the bytecopy1 loop are simply incrementing or decrementing 
index and counter variables. The PowerPC instruction set has indexed load and store instruc- 
tions that allow sharing of an index variable, and bytecopy2 shows how using the same index 
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for the load and store instructions enables you to eliminate one of the add immediate instruc- 
tions. You can eliminate the decrement instruction simply by using the index as the loop con- 
trol variable, as shown in Example 7.8. 





Using indexed load and store instruction. | 


_# Byte copy using indexed 1 
# r3 = pointer to source © oe 
# r4 = pointer to target | ee 
# r5 = nonzero number of bytes Se 
bytecopy2: : 
: niet r13,® 
loop: : 
: Lozx PS ris.re 
cmp Cr7, ris es. 
stbx rg,r13,r4 


addi = 1i3,1 | | 
Dit+  er7,i00p  . ¥# 


belr 


The loop body in Example 7.8 has been reduced from seven to five instructions; however, one 
cost of this approach is the requirement of an additional register (r13) to hold the index. The 
compare is scheduled early to avoid delays in resolving the conditional branch; however, to do 
this, it’s necessary to compare the byte count with the index value used in the current, not the 
next, loop iteration. When reordering instructions in this way, be sure to preserve the meaning 
of the program. For example, if the compare instruction is placed after, instead of before, the 
add immediate instruction that increments the index, the branch would need to be coded dif- 
ferently (bng instead of blt) to avoid exiting the loop one iteration too soon. 


Using Load and Store with Update 


Sometimes it’s possible to use an update form of the PowerPC load or store instructions to 
automatically increment an address. This technique is shown in Example 7.9, which has no 
explicit increment instructions, but instead uses the update feature to do this automatically. 
The index register has been eliminated by incrementing the source and target pointers directly. 
One drawback to this approach is that the addresses must be adjusted to allow for the incre- 
ment that occurs before the very first reference—hence the subtract-immediate instructions above 


the loop body. 
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EXAMPLE 7.9. 





To avoid explicit incrementation instructions in the body of the loop, the termination test is 
based directly on one of the addresses—the add instruction just above the loop label computes 
the address of the last byte to be copied, which serves as the operand of the compare instruc- 
tion following the load. You're back to seven total instructions, but three of them are executed 
only once—the loop body is only four instructions. 


Using the Count Register 


The count register and the CTR decrement and test capabilities of PowerPC branch instruc- 
tions can be used to eliminate many explicit compare-and-branch sequences used to control 
program loops. Using the CTR instead of the CR to control a loop is desirable, because in 
most cases, the hardware can resolve a CT R-based branch more quickly. Because the number 
of loop iterations is known, the byte copy loop is an ideal candidate for control with the CTR, 
as shown in Example 7.10. 


EXAMPLE 7.10. 
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In some special cases where the memory block size and alignment obey certain rules, it’s pos- 
sible to code a more efficient solution by “unrolling the loop” and reading and writing more 
than one byte at a time, but bytecopy4 probably is as good as it gets for a small and truly gen- 
eral byte-granular copy routine for PowerPC architecture. 


In some programming situations, it is necessary to terminate loops under several different con- 
ditions, possibly depending on values computed within the loop body itself. It almost never is 
desirable to modify the CTR in the body of the loop, because this can defeat the hardware that 
provides the performance advantage of using CTR-controlled loops in the first place. The CTR 
still can be used advantageously, however, using CR-based branches for “early exit” conditions. 
For example, Example 7.11 shows an enhancement of bytecopy4 that takes advantage of the C 
programming language conventions for strings. In C, a stringis a sequence of characters termi- 
nated by a null (0 byte). The compare and conditional branch instructions inserted in the loop 
cause exit from the loop if a null byte is reached in the source string before the specified num- 
ber bytes are copied (thus, the value in register r5 now is used as the maximum number of 


bytes to copy). 


EXAMPLE 7.11. 
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PowerPC branch instructions can test the CTR, the CR, or both at the same time, and Ex- 
ample 7.12 shows the same loop implemented with a single branch combining tests of both 
termination conditions! 


Incidentally, the name bytecopy was chosen deliberately to avoid confusion with the similar 
ANSI C library function strncpy. Unlike strncpy, bytecopy will not pad the end of the target 
string with null bytes, and it doesn’t return a result. You could modify bytecopy to mimic this 
behavior, but that’s not the point. There may be situations where zeroing the trailing portion 
of the target buffer is overhead that you want to eliminate, while retaining the safety and secu- 
rity of a limit that prevents writing past the end of the target buffer. 


EXAMPLE 7.12. 





Dealing with Overflow and Carry 


In many high-level languages (C springs to mind), the concepts of overflow and carry don’t 
appear in the language. The tacit assumption is that the programmer is responsible for choos- 
ing an integer type that provides enough precision for the computation, and it’s up to the user 
to make sure that the programmer’s assumptions aren’t violated. 


Even though most programming languages and programmers never use the features, almost all 
microprocessor architectures provide registers and instructions to deal with overflow and carry 
in fixed-point arithmetic instructions. These features are essential to permit implementation 
of integer types that are wider than a machine’s fixed-point general registers. Techniques for 
implementing multiple-precision arithmetic on PowerPC processors are described in Appen- 
dix A, “Programming Examples.” However, the information and capabilities provided also can 
be useful occasionally in single-precision computing. 
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In some machines, all integer instructions that could cause overflow set a status register, and 
the program can check the status if appropriate. Maintaining that status register, however, can 
restrict the capability of the processor to issue and complete instructions out of order. Because 
overflow and carry information is used so rarely, most PowerPC fixed-point instructions that 
can overflow have forms that will set the status register only if the programmer has explicitly 
requested so. 


This results in a potentially confusing array of options on PowerPC integer instructions. For 
example, when should you use a plain add instruction; and when should you use addo, addc, 
addco, adde, or addeo? All these instructions add 32-bit integers; the differences among them 
are the way that carry and overflow conditions are handled—for example, which conditions 
are recorded in the PowerPC fixed point exception register (XER) or the condition register 


(CR). 


First, if you aren’t concerned about overflow or carry situations, just use a simple add. In fact, 
you should make sure not to use the other forms, because setting overflow bits in the CR or 
XER could needlessly delay the execution of subsequent dependent instructions. 


Variants of the add carrying instruction (addcx) set the CA bit in the XER. This bit can be 


carried in to add extended instructions (addex) to compute multiple precision results. 


The overflow enable (OE) forms of these instructions (for example, addo, addco, and addeo) 
cause the overflow (OV) and summary overflow (SO) bits of the XER to be set if the instruc- 
tion overflows, which you should use if you want to detect an overflow condition and do some- 
thing about it later. The SO bit of the XER is sticky, meaning that once it is set, it remains set 
until you explicitly clear it, such as with an mrxer instruction. This instruction enables you to 
delay checking for an overflow condition until after a lengthy computation sequence; you can 
sum up a sequence of numbers in a loop, and then check for possible overflow once upon ex- 
iting the loop. 


Using PowerPC Floating-Point Instructions 


One of the biggest advantages of moving to the PowerPC microprocessor from some of the 
“classic” personal computer microprocessors is a substantial increase in floating point perfor- 
mance. All PowerPC processors provide much higher floating-point processing bandwidth than 
predecessor architectures. 


PowerPC processors support two native floating point modes: 32-bit single precision and 64- 
bit double precision. Both modes are compatible with the IEEE 754/854 standards for 
floating-point arithmetic. In most applications, you will want to use double-precision format, 
which yields the best accuracy. 


The performance advantage for single precision is primarily on the 601 and 603 processors, 
where double-precision multiplies take a cycle longer than single-precision multiplies. The 604 
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and 620 processors have no execution time penalty for double-precision operations. Actual 
latency varies by data value, but double-precision division generally takes a little longer on all 
PowerPC processors. All processors can benefit from the reduced memory space used by single- 
precision values—see the next section (“Looking At Memory System and Cache Considerations”) 
on memory performance to understand how. If you don’t need more than a few significant 
digits, single-precision values significantly can speed up some applications. 


PowerPC processors provide a family of multiply and add instructions for both single and double 
precision operands. These instructions permit very efficient implementation of some numeric 
and matrix operations. (See the matrix multiplication example in Appendix A.) 


The PowerPC architecture doesn’t have instructions for copying values from floating point 
registers to and from general registers; any communication between the register sets must go 
through memory. If you convert a floating-point value to an integer, you need to store it in 
memory and reload it into a GPR to operate on it with integer instructions. 


FLOATING POINT AND THE POWER MACINTOSH 





Looking At Memory System and Cache 
Considerations 


Virtually all modern high-performance microprocessors have on-chip cache memory. On-chip 
caches help prevent the processor from starving for memory bandwidth, especially in the face 
of increasing relative latency of off-chip communication and DRAM access times. In most cases, 
it’s not necessary for the programmer to worry or do anything about caches; the beauty of the 
idea is that caches almost always work well without any intervention or explicit control on the 
part of application programs. Even so, there are some things you can do to help “optimize” 
your program’s cache performance. 
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A word of warning, however. Optimizing cache performance can be a very difficult and pa- | 
tience-testing process. First, without hardware instrumentation (or at the very least, simula- | 
tion tools), it can be very difficult to measure accurately the cache behavior of your program. | 
In many cases, results aren’t even reliably reproducible from one execution to another on the 
same machine, due to context switches and other execution-dependent behavior. Accidents of | 
recompiling, relinking modules in a different order, moving variables in memory, and other 
seemingly innocent changes sometimes can result in bigger performance swings than your tuning 
efforts. Moreover, because cache designs and organizations vary so much across the PowerPC 
processor line (see Table 7.1), it’s hard to predict whether a technique that works for 601 chips 
will yield similar gains on any of the other chips. Having said that, there are some rules of thumb 
you can follow, and some promising research in cache optimization bears discussion. 


Table 7.1. Cache Designs and Organizations in the PowerPC Processor Line. 


PowerPC Size Instruction/ Replacement Line Size | 
Processor Data Associativity Policy (bytes) 
601 32K Unified — 8 LRU 32/64 | 
603 8K/8K 2/2 LRU 32 | 
604 16K/16K 4/4 LRU 32 
620 32K/32K 8/8 LRU 64 | 


—— SSS 


Even if you aren’t conversant with cache-design issues and terminology (see the following 
sidebar), you still can follow a few simple rules that could improve measurably your program’s 
bottom-line performance. ! 





CACHE BASICS 
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Reducing Cache Misses 


Cache misses often are classified into three general groups: compulsory misses, capacity misses, 
and conflict misses. Each of these misses arises due to a different aspect of cache and program 
behavior. It’s possible to formulate strategies for reducing all three types of misses. Which strategy 
works best depends on the distribution of each type in your program. 


Compulsory Misses 


The first time any memory block is referenced, the miss is compulsory, because the miss would 
occur regardless of the size or organization of the cache. Compulsory misses are a direct func- 
tion of the number of memory blocks touched by a program during its execution. Therefore, 
the only way to reduce compulsory misses is to reduce the total number of touched blocks. 


Capacity Misses 


When the cache is not large enough to hold all the blocks needed during execution, some blocks 
are loaded and discarded to make room for other blocks. Reloading such a block is classified as 
a capacity miss. Reducing capacity misses means effectively shortening the span of time during 
which particular blocks are needed. One way to do this is to partition an application into dis- 
tinct stages, each of which has a smaller working set. A simple example is to use an algorithm 
that makes a single pass through a large memory array (in particular, larger than the data cache), 
rather than an algorithm that makes two or more passes. The first time through, the misses are 
compulsory; as work progresses through the array, blocks holding portions of the beginning of 
the array must be discarded to make room for later portions. Returning for the second pass, 
the beginning of the array must be reloaded, and these are capacity misses. 
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Conflict Misses 


If a cache is fully associative, all misses are compulsory or capacity. If a cache is not fully asso- 
ciative, it is possible for a block to be discarded to make room for another block that maps to 
the same set. This is a conflict miss. Conflict misses are the trickiest, because you need to pay 
attention to addresses and layout of program sections. Fortunately, for most applications, any 
degree of associativity reduces conflict misses to a very small portion of the overall total. 


As pointed out earlier, the way to reduce compulsory misses is to reduce the total number of 
memory blocks that a program touches, which also indirectly reduces capacity misses. One 
strategy is to try to increase locality—to organize code and data so that instructions or data 
that are needed frequently are not grouped in blocks with things that rarely are needed. You 
could generate error and exception-handling code in one area, for example, and the mainline 
code in another area. This method reduces the likelihood of loading a block containing a se- 
quence of exception-handling instructions that always are branched around by the mainline. 


Reorganizing blocks of code based on execution frequency seems like an ideal job for a tool, 
and feedback directed program restructuring (FDPR) is available from IBM for the AIX envi- 
ronment. This tool uses trace profile feedback to reorganize the basic blocks of the executable; 
experiments show typical speedups of 20 percent. 


Even if you don’t have access to the tool, you can follow some simple coding practices to try to 
achieve similar results. 


Using Alignment and Cache Block Boundaries 


Alignment of branch targets on cache line boundaries can not only help cache performance, 
but it also can improve instruction parallelism. This is because the PowerPC processor dispatcher 
depends on being able to fetch several instructions at a time. If the target address of a taken 
branch is near the end of a cache line, it increases the likelihood that some of the instructions 
in the block (those above the branch target) are fetched, but then immediately discarded. Then, 
the fetcher immediately must issue another request for the instructions below the branch tar- 
get (in the following line). Instructions that might have been issued immediately had they ar- 
rived with the first fetch are delayed until the second block can be loaded into the instruction 
dispatch buffer. This situation is especially noticeable in the PowerPC 601 processor, which 
has a unified cache. Because the instruction fetcher and the load-store unit share the cache port, 
and the load-store unit always gets priority, a pending load or store instruction could delay the 
fetch of the next block of instructions for at least one additional cycle. It’s a good idea, there- 
fore, to align frequently executed branch targets on 32-byte boundaries (for 601 processors, 
which can load up to eight instructions per cycle), 16-byte boundaries (for 604 and 620 pro- 
cessors, which can load up to four instructions per cycle), or 8-byte boundaries (for 603 pro- 
cessors, which can load up to two instructions per cycle), especially if the target is the top of a 
loop. If the loop fits in a single cache line, it’s a good idea to align the branch target at the top 
of the loop to a 32- or 64-byte boundary. 
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For similar reasons, it can be a good idea to align data structures so that they don’t span cache- 
line boundaries. If structures are not accessed sequentially, and the cache is not large enough to 
hold all the data, it’s possible to reduce miss penalty cycles if the structures never span a cache- 
line boundary. The benefits of this capability, however, must be traded off against the cost of 
wasted storage if the size of the structure is not a factor or multiple of the line size. 


NSTRUCTIONS 
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Standard Symbol Definitions 


The PowerPC assemblers do not natively support symbolic arguments representing registers 
and registers fields. For example, the assembler expects “7”, not “r7”, as a GPR argument to an 
instruction. In order to make source code more readable, it’s beneficial to create symbolic con- 
stants which can be used as instruction arguments, Example A.1 is the standard set of symbolic 
constants used for writing the sample programs. This file uses the .set assembler pseudo-op to 
define the symbolic constants. This file is then included into each assembly language source 
file. For our examples we use the UNIX m4 macro preprocessor. An implementation of m4 is 
available from the Free Software Foundation in source code form. An alternative is to change 
the m4 include command to the C preprocessor’s include command. You can then run each 
source file through the C preprocessor before assembly. 


EXAMPLE A.1. 





Appendix A Wi Programming Examples | 16 





162 | PartiV Wl Appendixes 





Appendix A Wi Programming Examples 163 





® ® 
Linking to C 
When interfacing with C and using nonvolatile FPRs and GPRs, you can save a substantial 
amount of code by using the routines in Example A.2. These routines also illustrate how to use 
a label as an entry point by prefixing the label with a period. The technique of jumping into a 
table-like series of instructions is often useful in assembly language programming when you 


may want to perform some action using a register or set of registers that is not known until 
runtime. 
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EXAMPLE A.2. 
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The function framework shown in Example A.3 can be used as a model for functions that are 
called from a compiled language. The full prolog, full epilog, and the function descriptor are 


shown. As discussed in Chapter 6, some of the steps shown are optional. 


EXAMPLE A.3. 
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Program Timing 


One key aspect of tuning for performance is being able to measure the performance of code. 
Example A.4 shows aixtimer.h, a C header file which defines macros for timing. The timer 
facilities are POSIX specific, so this code should be portable to most operating systems with 
little modification. One important point to note is that in a multitasking operating system there 
is a difference between real (or wall clock) time and the time spent running any one program, 
since the operating system can switch between several processes. This timer shows the wall clock 
time, the system time, time spent running operating system code on behalf of the program, 
and user time, the time spent actually running the program’s code. User time will also include 
time spent running library code. One problem with this type of timer is that it is often not very 
accurate due to the overhead of maintaining times for all running programs. On AIX 3.2, this 
timer has a resolution of ten milliseconds, so in order to get accurate results, functions must 
have their run times extended either by increasing the amount of data processed, or by repeat- 
edly invoking the function. Example A.5 shows a C program which uses the macros from 
aixtimer.h to time mxmul on an extended data set. The source code for mxmul is shown in 
Example A.15. 


Note that on the Macintosh you should be able to get higher resolution times by simply using 
the move-from-time-base instructions (move-from-real-time-clock for the 601), since the 
Macintosh Operating System does not perform preemptive multitasking. 
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EXAMPLE A.4. 


aixtimer.h: C header file for timing called assembly functions. 


/* Example A.4 */ 

#include <stdio.h> 

#ifndef _POSIX_SOURCE 

#define _POSIX_SOURCE 

#endif 

#include <sys/times.h> 

#define ~ START _TIMER StartClk = times(&StartTMS) ; 


#define STOP_TIMER StopClk = times(&StopTMS) ; 
#define PRINT_TIME printTime(); 

Static struct tms StartTMS, StopTMs; 

static clock _t StartClk, StopClk; 

static void printTime() 

{ 


clock_t §, U; 
s = StopTMS.tms_stime - StartTMS.tms_stime; 


u = StopTMS.tms_utime - StartTMS.tms_utime; 
printf("Time: wall = %.2f, sys = %.2f, user = %.2f (proc = %.2f)\n", (StopClk 
- StartClk) / 100.0, s/100.0, u/100.0, (u + s)/100.0); 
‘ | 


EXAMPLE A.5. : 


try_mxmul.c: C program that uses aixtimer.h to time assembly language function 
mxmul. 





/* Example A.5 */ 

#include <stdio.h> 

#include <malloc.h> 

#include “aixtimer.h" 

extern void mxmul(int h, int n, int w, float A[][], float B[][], float 


REIL); 
7% 


R=A*B 

Aisnxh 

Biswxn 

Riswxh 
“J 
#define HEIGHT 128 
#define SIZE 256 
#define WIDTH §12 
main() 
{ 


float A[ HEIGHT] [SIZE] ; 
float B[ SIZE] [WIDTH] ; 
float R[ HEIGHT] [WIDTH]; 
int 4, 4, KK; 











Runtime Environment 


Often times it is important to know something about the mode the processor is running in, 
and the main problem is that user mode programs often cannot access the status registers which 
hold this information. Two modes of interest to user programs are little versus big endian and 
32- versus 64-bit mode. While the operating system may provide an API for discovering this 
information, Example A.6 and Example A.7 illustrate methods of detecting these modes for 
user mode programs. Example A.6 shows detecting 32- or 64-bit mode by taking advantage of 
the fact that OxOOOO0000FFFFFFFF is positive in 64-bit mode and negative in 32-bit mode. 
The detection of the endian mode shown in Example A.7 relies on the endian mode-specific 
behavior of the load and store instructions. 


64- or 32-Bit Mode 


EXAMPLE A.6, 
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64-Bit Integer Math on 32-Bit Machines 


These functions illustrate 64-bit unsigned arithmetic for 32-bit machines. By far the most com- 

. . (a . . , . . . * 
plex operation is division, since it doesn’t easily break down into some combination of several 
32-bit divides. Therefore, a shift-add method is used. Note the use of the condition register 
logical instructions to check for special cases near the top of Example A.11. When evaluating 
complex logical statements, you can either use the condition register logical instructions or con- 
struct a series of branch conditionals. Which method is better varies from situation to situa- 
tion. 


Notice that since a structure is returned by these functions, the caller implicitly passes the ad- 
dress for the return value as the first argument. 


Also, note in the shift routines in Example A.12 that no branches or comparisons are used due 
to the support for 64-bit shifts on 32-bit processors. 


Addition and Subtraction 


EXAMPLE A.8. 
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Multiplication 


EXAMPLE A.10. 





Division 


EXAMPLE A,11. 
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EXAMPLE A.12. 





Appendix A Wi Programming Examples 





String Operations 


Example A.13 shows the code for a simple string hash. The hash function simply XORs each 
character into an unsigned word, looping through the word byte-by-byte. 


The code for a Boyer-Moore type string search is shown in Example A.14. Notice that this 
example contains subfunctions, unlike most of the examples, and that the interfaces to these 
functions are tailored to the exact needs of their caller (and would not be callable from a com- 
piled language as they do not follow the subroutine calling conventions). This is one of the 
ways in which well-written assembly language code can outperform compiled languages—the 
programmer can use exact knowledge of the algorithm and the machine state to make the func- 
tion calls as efficient as possible. 
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String Hash 


EXAMPLE A,1 
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Boyer-Moore String Search 


EXAMPLE A.14, 
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Matrix Multiplication 


Example A.15 shows the code for a matrix multiply function. The matrices consist of single- 
precision floating-point numbers and may be any size. 


EXAMPLE A.15. 
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In this section, we give detailed descriptions of each instruction in the PowerPC instruction 
set. The notation used throughout this chapter is consistent with the rest of the book. (See 


Table B.1.) 


Table B.1. Notations used in this book. 


Notation 
(RT) 


[x] 
<-, [la] 


<< 


>> 


u< 


Description 


Register reference. This means the contents of RT. Thus, (R2) means 
the data contained in register R2. 


Memory reference. This means the memory location addressed by the 
value x. 


Assignment statements. The object on the left of the assignment 
symbol is given the value of the object on the right side of the symbol. 


Shift left. A<<B means that A is shifted left by B bits. 
Shift right. A>>B means that A is shifted right by B bits. 


Signed less than. This means less than and uses a signed comparison. 
Thus, OXFFFF < 0x0000 (—1 < 0). 


Unsigned less than. This means less than and uses an unsigned com- 
parison. Thus, 0x0000 u<OxFFFF (0 u< 65,535). 


Signed greater than. This means greater than and uses a signed com- 
parison. Thus, 0x0000 > OXFFFF (0 > —1). 


Unsigned greater than. This means greater than and uses an unsigned 
comparison. Thus, OXFFFF u> 0x0000 (65,535 u> 0). 


Equal to. Thus, 0x0000 == 00000 (0 == 0). 

Not equal to. Thus, 0x0000 != OXFFFF (0 != —1). 

Concatenate. Thus, 0xF || OxD means OxFD. 

Boolean AND. OXF55F & OX5SFAF = 0x550F. 

Boolean OR. OXF55F | OXSFAF = OXFFFF. 

Boolean NOT. A means not A. So A| = Ob1 and A & A = ObO. 
Boolean XOR. Thus, A® A = Ob1 and A @ A = ObO. 

Boolean equivalence. Thus, A= A = 0b0 and A=A = Obl. 

Sign extend. sign_ext(O0xXFF) = OXFFFF, and sign_ext(0x7F) = 0x007F. 


Subfield. Subscripts used in this fashion mean a subfield of the contents 
of Rn. Thus, Rno.g means bits 0 through 8 of the register Rn. 


Repeat. This is a repeat symbol. A means to concatenate a copies of A. 
Addition. 0x0001 + OxFO01 = 0xFO002. 
Subtraction. 0x0444 — 0x0434 = 0x0010. 


Appendix B Wi Detailed Instruction Set Reference 


Notation Description 

<= Multiplication. 0x0002 * 0x0004 = 00008. 

% Modulo. This means modulo or remainder. A%B means the remainder 
if you divide A by B. 

+ Divide. A/B means A divided by B. 

fi Conditional operator. An if then else structure. The statement A?B:C 
takes the value of B if A evaluates to true; otherwise, it takes the value 
of C. 


Throughout this reference, we use bit values associated with the 32-bit PowerPC architecture 
except for those instructions that are only available on 64-bit implementations. When no bit 
numbering is given, either a note is made for 64-bit implementations showing the appropriate 
bit numbering, or the 64-bit operation only operates on the low-order 32-bits (bits 32-63) of 
the operands. In the latter case, the bit numbering for 64-bit implementations can be obtained 
by adding 32 to each bit number. 


Fields which are reserved are marked with a forward slash (/). These fields should always be set 
to 0. 


add[o]|[.] 


integer add instruction 


p34 ade RT sd RA isl RB soll 256 wf RO, 


add[o][.] RT, RA, RB 
RT <— (RA) + (RB) 


The add instruction stores the sum of RA and RB into RT. 


If the addo[.] form of the instruction is used, then the overflow and summary overflow bits of 
the XER are set if overflow occurs. The overflow bit is cleared if overflow does not occur. 


If the add[o]. form of the instruction is used, then CRO is updated. 


addc[o][.] 
integer add carrying instruction 


addc[o][.] RT, RA, RB 


RT — (RA) + (RB) 
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The addc instruction stores the sum of RA and RB into RT. The XERCA bit is set to the carry- 
out of the addition. 


If the addco[.] form of the instruction is used, then the overflow and summary overflow bits of 
the XER are set if overflow occurs. The overflow bit of the XER is cleared if overflow does not 
occur. 


If the addc[o]. form of the instruction is used, then CRO is updated. 


adde[o}][.] 
integer add extended 


op 3t de RT ids BA sls RB alo 138 0] Roy, 


adde[o][.] RT, RA, RB 


RT — (RA) + (RB) + XERc4 


The add extended instruction performs an add with the carry bit in the XER as an implicit 
operand. Thus RT is loaded with the sum of RA, RB and the carry bit in the XER. The XERCA 
bit is set by the carry out of this addition. 


If the addeo[.] form of the instruction is used, then the overflow and summary overflow bits of 
the XER are set if overflow occurs. The overflow bit of the XER is cleared if overflow does not 
occur. 


If the adde[o]. form of the instruction is used, then CRO is updated. 
addi 


integer add immediate 


ge le 


addi[.] RT, RA, si 


RT <— (RA) + ('Ssi,¢ Il siye 31) 


The add immediate instructions perform an add of RA to the sign extended 16-bit immediate 
field, placing the result into RT. 


If the addi form of the instruction is used with RA = 0, then 0 is used instead of the contents 
of RA. Thus regardless of the contents of r0, addi r1, r0, 4 would load the constant 4 into rl 
(see the li instruction). 

addic[.] 

integer add immediate carrying 
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p © dReJe RT ly RA ule Ea 


addic[.] RT, TA, si 


RT — (RA) + (si, | Hew) 


The add immediate instructions perform an add of RA to the sign extended 16-bit immediate 
field, placing the result into RT. The XERCA bit is updated with the carry out of the addition. 


If the addic. form of the instruction is used, then CRO is updated. 
addis 
integer add immediate shifted 


pS de RT ig RA ude = 


addis RT, RA, si 
RT — (RA) + (si Il !0) 


The add immediate shifted instruction performs an addition of RA to the immediate field left 
shifted by 16 bit (that is the immediate field with 16 zeros concatenated to the right), placing 
the result into RT. 


If the addis instruction is used with RA = 0, then 0 is used instead of the contents of RA. Thus 
regardless of the contents of RA, addis rl, r0, 4 would load the constant 411160 = 262,144 into 
rl (see the lis instruction). 


addme[o][.] 


integer add extended to minus one 


addme[o][.] RT, RA 


RT & (RA) + XERa, =1 


The add extended to minus one instruction performs an add with the carry bit in the XER as 
an implicit operand. Thus RT is loaded with the sum of RA, -1 and XERCA. The XERCA bit 
is set by the carry out of this addition. 


If the addmeo[.] form of the instruction is used, then the overflow and summary overflow bits 
of the XER are set if overflow occurs, The overflow bit of the XER is cleared if overflow does 
not occur. 


If the addme[o]. form of the instruction is used, then CRO is updated. 
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addze[o] [.] 
integer add extended to zero 


addze[o][.] RT, RA, RB 


RT < (RA) + XERc, 


The add extended to zero instruction performs an add with the carry bit in the XER as an implicit 
operand. Thus RT is loaded with the sum of RA and XERCA. The XERCA bit is set by the 
carry out of this addition. 


If the addzeo[.] form of the instruction is used, then the overflow and summary overflow bits 
of the XER are set if overflow occurs. The overflow bit of the XER is cleared if overflow does 
not occur. 


If the addze[o]. form of the instruction is used, then CRO is updated. 


and|[.] 

logical AND 
9 thy BA sls RE dis RB des 78 gl 4, 
and[.] RT, RA, RB 


RT — (RA) 4 (RB) 
The and[.] instruction performs a logical AND of register RA with register RB. The result of 
this is then stored into register RT. 
If the Re bit is set (and.), then CRO is updated. 


andc[.] 
logical AND with compliment 


o 3tade PA da RT dig RB ln © od RG, 


andc[.] RT, RA, RB 


RT <— (RA) “ (RB) 


The andc[.] instruction performs a logical AND of register RA with the logical negation of 
register RB. The result of this is then stored into register RT. 


If the Rc bit is set (andc.), then CRO is updated. 


Appendix B Wi Detailed Instruction Set Reference 


andi 


logical AND immediate 
p78 de RA du RT ie 
andi. RT, RA, ui 


RT — (RA) A (601 ui) 


The andi. instruction performs a logical AND of the low-order 16 bits of the contents of reg- 
ister RA with the immediate field ui; the high-order 16 bits of RA are unchanged. The result of 
this is then stored into register RT. 


For this instruction, CRO is always updated. 
andis. 
logical AND shifted immediate 


79 de RA du RTs 


andis. RT, RA, ui 
RT — (RA) A (uill '90) 


The andis. instruction performs a logical AND register RA with the immediate field ui. The 
result of this is then stored into register RT. 


For this instruction, CRO is always updated. 


b{I] 


unconditional branch 
b[I}[a] target_addr 
bl]a: PC — (SLig II Lig 5 1120) 


bil]: PC — PC + (®Lig II Lig 5 10) 
b[I]a: b[l]: The unconditional branch instruction transfers control to the instruction at the 


target address (target_addr). Note that target_addr must be divisible by four (that is quad- 
word aligned). 
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The absolute unconditional branch (b[l]a) instructions uses the sign extended immediate field 
of the instruction (Li) with two binary zeros appended to it as the target address. 


The relative unconditional branch (b[I]) instruction current value of the program counter (PC) 
added to the sign extended immediate field of the instruction (Li) with two binary zeros ap- 
pended to it as the target address. 


If the bl[a] form of the instruction is used, then the link register is loaded with the old value of 
the PC added to 4 (that is the address of the branch, plus four). 


be[I] [a] 
conditional branch 


p 16 de BO gly Blige BP AAI EK, 


be[I][a] BO, BI, target_addr 
bella: PC — (BD, II BD,¢ 29 120) iff condition 


bell]: PC — PC + (BD,¢ I BD 6 99 1120) iff condition 


bc[I]a: be[l]: The conditional branch instruction may transfer control to the instruction at the 
target address (target_addr). Note that target_addr must be divisible by four (that is quad-word 
aligned). Control is transfered only if some condition is met. The BO field defines the condi- 
tion which must be met (see Table B.2). There are many extended mnemonics associated with 
this instruction which are used to specify the BO and BI fields implicitly (see Tables B.3 
and B.4). 


The absolute conditional branch (bc[I]a) instruction uses the sign extended immediate field of 
the instruction (BD) with two binary zeros appended to it as the target address. 


The relative conditional branch (be[l]) instruction uses the current value of the program counter 
(PC) added to the sign extended immediate field of the instruction (BD) with two binary zeros 
appended to it as the target address. 


If the bel[a] form of the instruction is used, then the link register is loaded with the old value 
of the PC added to 4 (that is the address of the branch, plus four). 


Table B.2. BO field encodings for branch conditional, branch conditional to count, and branch 
conditional to link instructions. 


encoding definition 
Ob0000y°" Decrement the count register; then branch if the new value in the 
count register is not 0, and the condition‘ is false. 


encoding’ 
Ob0001ly 


0b0010y 
0b0100y 


Ob0101y 


0b0110y 
0b1000y 


Ob1001ly 


0b10100 
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definition 

Decrement the count register; then branch if the new value in the 
count register is 0, and the condition is false. 

Branch if the condition is false. 


Decrement the count register; then branch if the new value in the 
count register is not 0, and the condition is true. 


Decrement the count register; then branch if the new value in the 
count register is 0, and the condition is true. 


Branch if the condition is true. 


Decrement the count register, then branch if the new value in the 
count register is not O. 


Decrement the count register, then branch if the new value in the 
count register is 0. 


Branch always. 


a. These are the only valid BO field encodings. 
b. The y-bit is the bit which reverses the default branch prediction. 


c. The condition is determined by the bit in the condition register specified in the BI field. A 


value of 0 is a false condition while a value of one is a true condition. 


Table B.3. Extended mnemonics for the different BO field encodings. 


Target address type 
Branch (bc) (bca) (bclr) (bcctr) 
Semantics (X*) relative absolute  tolink to count 
branch unconditionally — — blr[I]a betr[]] 
branch if condition br[I] be[l]a belr{1] btctr{l] 
true (t) 
branch if condition bf[I] bf[l]a bflr{I] bfctr[l] 
false (f) 
decrement count and bdnz[I] bdnz[l]a bdnzlr[l] — 


branch if count 
non-zero (dnz) 
decrement count and bdz[l] bdz{l]a bdzlr{I] — 


branch if count zero (dz) 
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Table B.3. continued 


Branch 
Semantics (X*) 


decrement count and 
branch if count 


non-zero and condition 


true (dnzt) 


decrement count and 
branch if count non- 
zero and condition 


false (dznf) 


decrement count and 
branch if count zero 


and condition true (dzt) 


decrement count and 
branch if count non- 
zero and condition 


false (dzf) 


a. The building block which is used to form the mnemonic is shown in parentheses beside the 
semantic description. The mnemonic for the branch is built out of three components: b[CR 


(bc) 


relative 
bdnzt[I] 


bdnzf[I] 


bdzt[l] 


bdzf{l] 


Target address type 
(bca) (bclr) 
absolute _—_ to link 
bdnzt[l]a bdnztlr[l] 
bdnzf[l]a bdnzflr{I] 
bdzt[l]a — bdztlr[I] 
bdzf[l]a —_ bdzflr{]] 


code] [target code] [1] or b[CR code][1] [target code]. 


Branch 
Semantics (X*) 


branch if less than 
(It) 

branch if less than 

or equal to (le) 
branch if greater than 
or equal to (ge) 
branch if greater than 
(gt) 

branch if not less 


than (nl) 


(be) 
relative 
ble[l] 
ble[l] 
bge[l] 
bgt[l] 


bniI [I] 


Table B.4, Extended mnemonics for condition register bit encodings. 


Target address type 
(bca) (bclr) 
absolute to link 
nit[l]a blelr{I] 
ble[l]a blelr{I] 
bge[l]a bgelr[I] 
bet[I]a berlr[I] 
bni{lla bnillr[]] 





(bcctr) 


to count 


(bcctr) 
to count 
bletctr{I] 
blectr[I] 
bgectr[I] 
bgctr[]] 


bnictr{]] 
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Target address type 
Branch (bc) (bca) (bclr) (bcctr) 
Semantics (X*) relative absolute to link to count 
branch if not equal bne[]] bne[lla bnelr{l] — bnectr{I] 
to (ne) 
branch if not greater bng{l] bng[lla bnelr[l] — bngetr[I] 
than (ng) 
branch if summary bso[]] bso[l]a bsolr[I] bsoctr[]] 
overflow (so) 
branch if not summary bns[I] bns{[l]a bnslr{I] bnsctr[I] 
overflow (ns) 
branch if unordered bun{I] bun[l]a bunlr{l] = bunctr[l] 


(see floating-point 

compare instructions 

below) (un) 

branch if not unordered bnu[{I] bnullla bnulrfl] = bnuctr[I] 
(see floating-point 

compare instructions 

below) (nu) 


a. The building block which is used to form the mnemonic is shown in parentheses beside the 
semantic description. The mnemonic for the branch is built out of three components: 


b[CR code] [target code] [I]. 


bectr[I] 


conditional branch to count register 


0 19 516 BO in ee 16 / 20] 21 528 30 LK, 


bectr[I] BO, BI 


bectr[I]: PC — CTRo 49 Il 20 iff condition 


The conditional branch to count register instruction may transfer control to the instruction at 
the address contained in the count register. Control is transfered only if some condition is met. 
The BO field defines the condition which must be met (see Table B.2). There are many ex- 
tended mnemonics associated with this instruction which are used to specify the BO and BI 
fields implicitly (see Tables B.3 and B.4). Note that settings of the BO field which correspond 
to type branches are not valid with this instruction (see Tables B.2 and B.3). 


199 


MAN ME Tt AR TR a OE AT aS tes Nott WR SAN haa be oe a 


a 


200 


PartIV Wi Appendixes 


If the bectrl form of the instruction is used, then the link register is loaded with the old value 
of the PC added to 4 (that is the address of the branch, plus four). 


belr[I] 
conditional branch to link register 


A FF 1 


beir{l] BO, BI 


belr{l]: PC — LRp 49 || 70 iff condition 


The conditional branch to link register instruction may transfer control to the instruction at 
the address contained in the link register. Control is transfered only if some condition is met. 
The BO field defines the condition which must be met (see Table B.2). There are many ex- 
tended mnemonics associated with this instruction which are used to specify the BO and BI 
fields implicitly (see Tables B.3 and B.4). 


If the belrl form of the instruction is used, then the link register is loaded with the old value of 
the PC added to 4 (that is the address of the branch, plus four). 


clridi[.] 


clear left doubleword immediate 


ciridi[.] RT, RA, n (n<64) 


RT <— "O|| RA ? 


The clrldi[.] instruction clears the n leftmost bits of RA and places the result in RT. This is an 
extended mnemonic for the rotate left doubleword immediate and clear left (ridicl[.]) in- 
struction: 


ridicl[.] RT, RA, 0, n 
If the record bit is set (clrldi.), then CRO is updated. 


This instruction is only defined for 64-bit implementations. 


clrisldi[.] 
clear left and shift left doubleword immediate 
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0 516 10} 11 15 }16 ~~ 20} 21 26 


clrisidi[.] RT, RA, b, n(n b< 64) 


27 29 30 31 


RT < “Ol RA {10 


The clrlsldif.] instruction clears the b leftmost bits of RA, then left shifts that by n bits and 
places the result in RT. This is an extended mnemonic for the rotate left doubleword immedi- 
ate and clear (rldic[.]) instruction: 


rldic[.] RT, RA, n, b-n 
If the record bit is set (clrlsldi.), then CRO is updated. 
This instruction is only defined for 64-bit implementations. 


clrlslwi[.] 
clear left and shift left word immediate 


clrisiwi[.] RT, RA, b, n (1 < b < 32) 





21 25 | 26 30 31 


RT — °"0 || RA, ,, ll "0 


The clrlslwi[.] instruction clears the b leftmost bits of RA, then left shifts that by n bits and 
places the result in RT. This is an extended mnemonic for the rotate left word immediate then 
AND with mask (rlwinm|[.]) instruction: 


rlwinm[.] RT, RA, n, b-n, 31-n 
If the record bit is set (clrlslwi.), then CRO is updated. 


clriwi[.] 
clear left word immediate 


clriwi[.] RT, RA, b, n (n<32) 
RT — "OUR, 3, 
The clrlwif.] instruction clears the n leftmost bits of RA and places the result in RT. This is an 


extended mnemonic for the rotate left word immediate then AND with mask (rlwinm|.]) in- 
struction: 


rlwinm|.] RT, RA, 0, n, 31 
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If the record bit is set (clrlwi.), then CRO is updated. 
clrrdi[.] 


clear right doubleword immediate 


clridi[.] RT, RA, n (n<64) 





Pa A PE Oe 





RT — RA | "0 
0, 63-n 


The clrrdi[.] instruction clears the n rightmost bits of RA and places the result in RT. This is 
an extended mnemonic for the rotate left doubleword immediate and clear right (rldicr[.]) 
instruction: 


ridicr[.] RT, RA, 0, 63-n 
If the record bit is set (clrrdi.), then CRO is updated. 
This instruction is only defined for 64-bit implementations. 


clrrwi[.] 
clear right word immediate 


pt de PA du RTsle O solr © addy 3] Rey 


clrrwi[.] RT, RA, n (n<32) 


RT — Ry 31,10 


The clrrwi[.] instruction clears the n rightmost bits of RA and places the result in RT. This is 
an extended mnemonic for the rotate left word immediate then AND with mask (rlwinm|[.]) 
instruction: 


rlwinm[.] RT, RA, 0, 0, 31-n 
If the record bit is set (clrrwi.), then CRO is updated. 


cmp 
integer compare 





cmp BF, L, RA, RB 





CRpp <— (RA) < (RB) Il (RA) > (RB) II (RA) = (RB) Il XERgo 
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The cmp instruction performs a signed word or signed doubleword comparison of the con- 
tents of RA and RB. A word compare is done if the L-bit is a 0, while a doubleword compare 
is performed if the L-bit is a 1. The doubleword compare form is only defined for 64-bit imple- 
mentations. 


cmpd 


compare doubleword 


pt BF L/D RA dsRBoln 9 al / 


cmpd BF, RA, RB 


CR pp < (RA) < (RB) || (RA) > (RB) II (RA) = (RB) II XERso 


The cmpd instruction performs a signed doubleword compare of the contents of RA and RB. 
This is an extended mnemonic for the compare (cmp) instruction: 


cmp BF, 1, RA, RB 
It is only defined for 64-bit implementations. 


cmpdi 


compare doubleword immediate 


pM PF Doh RAs = § 


cmpdi BF, RA, si 


CRpp — (RA) <si Il (RA) > si Il (RA) = si II XERgo 


The cmpdi instruction performs a signed doubleword compare of the contents of RA to the 
sign extended value si. This is an extended mnemonic for the compare immediate (cmpi) in- 
struction: 


cmpi bf, 1, RA, si 
It is only defined for 64-bit implementations. 


cmpld 
compare logical (unsigned) doubleword 





cmpld BF, RA, RB 


CRpp < (RA) < (RB) Il (RA) > (RB) II (RA) = (RB) II XERo 
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The cmpld instruction performs an unsigned doubleword compare of the contents of RA and 
RB. This is an extended mnemonic for the compare logical (cmpl) instruction: 


cmpl bf, 1, RA, RB 
It is only defined for 64-bit implementations. 


cmpldi 
compare logical (unsigned) doubleword immediate 


pO BP sh RAs 


cmpldi BF, RA, ui 


CRpp — (RA) < ui ll (RA) > ui Il (RA) = ui II XERgo 


The cmpldi instruction performs an unsigned doubleword compare of the contents of RA 
to the value ui. This is an extended mnemonic for the compare logical immediate (cmpli) in- 
struction: 


cmpli bf, 1, RA, ui 
It is only defined for 64-bit implementations. 


cmpi 
integer compare immediate 


ode BP loll BAe En 


cmpi BF, L, Ry 


CR pp < (RA) < si || (RA) > si Il (RA) = si Il XERgo 


The compare immediate (cmpi) instruction performs a signed word or signed doubleword 
comparison of the contents of RA to the sign extended value si. A word compare is done if the 
L-bit is a 0, while a doubleword compare is performed if the L-bit is a 1. The doubleword 
compare form is only defined for 64-bit implementations. 

cmpl 

compare logical (unsigned) 


cmpl BF, L, RA, RB 








CRpp — (RA) < (RB) Il (RA) > (RB) Il (RA) = (RB) II XERgo 
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The compare logical (cmp) instruction performs an unsigned word or unsigned doubleword 
comparison of the contents of RA and RB. A word compare is done if the L-bit is a 0, while a 
doubleword compare is performed if the L-bit is a 1. The doubleword compare form is only 
defined for 64-bit implementations. 


cmpli 
compare logical (unsigned) immediate 


p10 dB al / alloy Ryde = 


cmpli bf, L, Ry,ui 


CRBF <— (RA) <ui || (RA) > ui II (RA) = ui Il XERgo 


The compare immediate (cmpli) instruction performs an unsigned word or unsigned 
doubleword comparison of the contents of RA to the value ui. A word compare is done if the 
L-bit is a 0, while a doubleword compare is performed if the L-bit is a 1. The doubleword 
compare form is only defined for 64-bit implementations. 


cmpw 
compare word 


p 31d BF sl / [hy RA isle RBaolr 9 nd / 


cmpw BF, RA, RB 
CRpp <— (RA) < (RB) Il (RA) > (RB) ll (RA) = (RB) II XERgo 


The cmpw instruction performs a signed word compare of the contents of RA and RB. This is 
an extended mnemonic for the compare (cmp) instruction: 


cmp BF, 0, RA, RB 


cmpwi 
compare word immediate 


p PFO Lil PAs = 


cmpwi BF, RA, si 


CRpp <— (RA) <si Il (RA) > si Il (RA) = si Il XERgo 


The cmpwi instruction performs an signed word compare of the contents of RA to the sign ex- 
tended value si. This is an extended mnemonic for the compare immediate (cmpi) instruction: 


cmpi bf, 0, RA, si 
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cmplw 


compare logical (unsigned) word 


op Ft le BE l/s! lu RA isle RBaoln 32 “wl / 


cmplw BF, RA, RB 





CRpp <— (RA) < (RB) Il (RA) > (RA) II (RB) = (RB) Il XERgo 


The cmplw instruction performs an unsigned word compare of the contents of RA and RB. 
This is an extended mnemonic for the compare logical (cmpl) instruction: 


cmpl bf, 0, RA, RB 


cmplwi 
compare logical (unsigned) word immediate 


ee OS 


cmplwi BF, Ry, si, ui 


CRap <— (RA) < ui Il (RA) > ui Il (RA) = ui Il XERgo 


The cmplwi instruction performs an unsigned word compare of the contents of RA to the value 
ui. This is an extended mnemonic for the compare logical immediate (cmpli) instruction: 


cmpli BF, 0, RA, ui 
cntlzd[.] 


count leading zeros doubleword 


ee LY a ei 


cntlzd[.] RT, RA 


The cntlzd[.] instruction counts the number of leading zeros in RA and places the count into 
RT. The count ranges from 0 to 64 inclusive. This instruction is defined only on64-bit imple- 
mentations. 


If the record bit is set (cntlzd.) then CRO is updated. 


cntlzw[.] 
count leading zeros word 


eee eee eee 


——EEEEEeEeEeEeEE—IIx* 
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oT de RA ds RT itis / soln 26 ad BS 


cntlzw[.] RT, RA 


_ 


The cntlzw[.] instruction counts the number of leading zeros in RA and places the count into 
RT. The count ranges from 0 to 32, inclusive. This instruction is defined on both 32-bit and 
64-bit implementations. 


If the record bit is set (cntlzw.) then CRO is updated. 


crand 
condition register AND 


6 BT lu BA isle BB ads 757d 


crand BT, BA, BB 


CRar <— CRa, 4 CRep 


The crand instruction performs a logical AND of CR bit BA (CRBA) and CR bit BB (CRBB) 
and places the result into CR bit BT (CRBT). 


crandc 
condition register AND with compliment 


19 sds BT ids BA sie BB ool 129 al / 


crandc BT, BA, BB 


o 


CRpr — CRp, 4 CRap 


The crandc instruction performs a logical AND of condition register bit BA (CRBA) and the 
logical negation of condition register bit BB (CRBB) and places the result into condition reg- 
ister bit BT (CRBT). 


crclr 
condition register clear 


19 de BT la BT he BT als 193 Mn 


erclr BT 


i) 


CRar <0 
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The crelr instruction clears condition register bit BT (CRBT). This is an extended mnemonic 
for the condition register XOR (crxor) instruction: 


crxor BT, BT, BT 


creqv 
condition register equivalent (KNOR) 


ae ee Ae ee 
creqv BT, BA, BB 


CRor — CRp, @ CRyp 


The creqv instruction performs a logical equivalence (KNOR) of condition register bit BA 
(CRBA) and condition register bit BB (CRBB) and places the result into condition register bit 
BT (CRBT). 


crmove 
condition register move 


ee a ee a 
crmove BT, BA 


CRar <— CR, 


The crmove instruction copies condition register bit BA (CRBA) into condition register bit 
BT (CRBT). This is an extended mnemonic for the condition register OR (cror) instruction: 


cror BT, BA, BA 


crnand 

condition register NAND 
ee re a 
crnand BT, BA, BB 


OR Cha h Chae 


The crnand instruction performs a logical NAND of condition register bit BA (CRBA) and 
condition register bit BB (CRBB) and places the result into condition register bit BT (CRBT). 


crnor 
condition register NOR 
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Co 
wu 
ma) 


9 de BT ids BA ols BB ln 33 / 


crnor BT, BA, BB 
CRpr <—CRaY CRap 


The crnor instruction performs a logical NOR of condition register bit BA (CRBA) and con- 
dition register bit BB (CRBB) and places the result into condition register bit BT (CRBT). 


crnot 
condition register not 


BT id BA whe BB ln 44 nd In 


crnot BT, BA 


CRar e CR, 


The crnot instruction performs a logical negation of condition register bit BA (CRBA) and 
places the result into condition register bit BT (CRBT). This is an extended mnemonic for the 
condition register NOR (crnor) instruction: 


crnor BT, BA, BA 


cror 
condition register OR 


» 19 de BT ids BA isis BB adn 449 nol / 
cror BT, BA, BB 


Rei OR ORy, 


The cror instruction performs a logical OR of condition register bit BA (CRBA) and condi- 
tion register bit BB (CRBB) and places the result into condition register bit BT (CRBT). 


crorc 
condition register OR with compliment 


a 19 de BT sds BA ude BB ads 417 


crorc BT, BA, BB 


tine 


5 


CR —CRy,¥ Cheap 
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The crorc instruction performs a logical OR of condition register bit BA (CRBA) and the logi- 
cal negation of condition register bit BB (CRBB) and places the result into condition register 
bit BT (CRBT). 


crset 
condition register set 


eae ae ee a ae ae 
crset BT 


CRar — 1 


The crset instruction sets condition register bit BT (CRBT) to one. This is an extended mne- 
monic for the condition register equvalence (creqv) instruction: 


creqv BT, BT, BT 

crxor 

condition register XOR 
p19 de BT idly BA isle BB ols 193d /s 
crxor BT, BA, BB 


CRar e CRp, & CRap 


The crxor instruction performs a logical XOR of condition register bit BA (CRBA) and condi- 
tion register bit BB (CRBB) and places the result into condition register bit BT (CRBT). 


dcbf 

data cache block flush 
= a iG 
dcbf RA, RB 


The dcbf instruction flushes a cache block into memory. If RA is 0 (r0), then the contents of 
RB are used as the effective address, otherwise, the sum of the contents of RA and the contents 
of RB is used as the effective address. The cache block addressed by this effective address is 
flushed from the data cache. 


The exact action taken depends on whether memory coherence is enforced for that address 
(see Virtual Memory), and on the state of that address in the cache. 
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1. Coherence required 
unmodified block: Invalidate all copies of the block in the data caches of all proces- 
sors. 


modified block: Store the block back to main memory (from whichever caches have 
the modified data) then invalidate all copies of the block in the data caches of all 


processors. 


absent block: If the block is absent from the cache, then store back any modified 
copies of the block in other caches (if there are any), and then invalidate all copies 
of the block in the data caches of all processors. 

2. Coherence not required 
unmodified block: Invalidate all copies of the block in the data caches of this 


processor. 


modified block: Store the block back to main memory (from whichever caches in this 
processor have the modified data) then invalidate all copies of the block in the data 
caches of this processor. 


absent block: Do nothing. 


dcbi 
data cache block invalidate 


yds Ms RA his RB aol 470 sal Ma 


dcbi RA, RB 


The dcbi instruction invalidates a cache block. If RA is 0 (r0), then the contents of RB are used 
as the effective address, otherwise, the sum of the contents of RA and the contents of RB is 
used as the effective address. The cache block addressed by this effective address is invalidated 
from the data cache. 


The exact action taken depends on whether memory coherence is enforced for that address 
(see Virtual Memory), and on the state of that address in the cache. 
1. Coherence required 


unmodified block: Invalidate all copies of the block in the data caches of all proces- 


sors. 


modified block: invalidate all copies of the block in the data caches of all processors 
(this discards the modified data). 


absent block: If the block is absent from the cache, then invalidate all copies of the 
block in the data caches of all processors (possibly discarding modified data). 


2. Coherence not required 
unmodified block: Invalidate all copies of the block in the data caches of this 


processor. 


zit 
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modified block: Invalidate all copies of the block in the data caches of this processor 
(discard the modified contents). 


absent block: Do nothing. 
This instruction is privileged (see Appendix C). 


dcbst 

data cache block store 
pier. ey he 
dcbst RA, RB 


The debst instruction stores a cache block into memory. If RA is 0 (r0), then the contents of 


RB are used as the effective address, otherwise, the sum of the contents of RA and the contents 
of RB is used as the effective address. The data cache block addressed by this effective address 
is stored into main memory. 


The exact action taken depends on whether memory coherence is enforced for that address 
(see Virtual Memory), and on the state of that address in the cache. 
1. Coherence required 


unmodified block: If any other processor has a modified copy of the cache block, then 
store it back to memory. 


modified block: Store the block back to main memory. If any other processor has a 
modified copy of the cache block, then store it back to memory. 


absent block: If the block is absent from the cache, then store back any modified 
copies of the block in other caches (if there are any). 


2. Coherence not required 
unmodified block: Do nothing. 
modified block: Store the block back to main memory. 
absent block: Do nothing. 


dcbt 

data cache block touch 
ee a ee 
dcbt RA, RB 


The dcbt instruction is a hint to the processor that an access to the address of this instruction 
will occur in the near future. If RA is 0 (r0), then the contents of RB are used as the effective 
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address, otherwise, the sum of the contents of RA and the contents of RB is used as the effec- 
tive address. 


This instruction can be used to reduce the cache miss penalty on predictable accesses in a pro- 
gram (typically, a processor will treat this instruction as a byte load without a register target). 


dcbtst 
data cache block touch for store 


ode fod RA slg RB adr 246 sol /s 


dcbtst RA, RB 


The debtst instruction is a hint to the processor that a store access to the address of this instruc- 
tion will occur in the near future. If RA is 0 (r0), then the contents of RB are used as the effec- 
tive address, otherwise, the sum of the contents of RA and the contents of RB is used.as the 
effective address. 


This instruction can be used to reduce the cache miss penalty on predictable store accesses in 
a program (typically, a processor will treat this instruction as a byte load without a register tar- 
get but will initiate a read-with-intent-to-modify operation). 


dcbz 
data cache block zero 


ode flu PA ithe RB ls 1004 l/s 


dcbz RA, RB 


The dcbz instruction sets a cache block to 0. If RA is 0 (r0), then the contents of RB are used 
as the effective address, otherwise, the sum of the contents of RA and the contents of RB is 
used as the effective address. The data cache block addressed by this instruction is set to 0. 


If the block to be zeroed is not in the cache, but is caching allowed, then it is established in the 
cache (in a modified state) without fetching from main memory. if theaddress of the instruc- 
tion is cache inhibited or write-through required, then either all bytes corresponding to that 
cache block are set to 0 in main memory, or the system alignment interrupt handler is initi- 
ated. Note that this instruction discards modified data in this cache (or any other processor's 
cache, if coherent). 
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divd[o][.] 
integer divide doubleword 


pT de BT dus A isis RB wl Siler 489] Rey 


divd[o][.] RT, RA, RB 


RT <— (RA) + (RB) 


The divd[o][.] instruction divides RA by RB and stores the result into RT. Both the operands 
and the quotient are signed doubleword integers. Division by 0 (contents of RB are 0) or divi- 
sion of the maximum negative integer by -1 leaves RT undefined. 


If the divdo[.] form of the instruction is used, then the overflow and summary overflow bits of 
the XER are set if overflow occurs. Division by 0 or division of the maximum negative integer 
by -1 sets the overflow bit. 


If the divd[o]. form of the instruction is used, then CRO is updated. Division by 0 (contents of 
RB are 0) leaves the LT, GT, and EQ, fields of CRO undefined if this form is used or division 


of the maximum negative integer by -1. 


This instruction is only available on 64-bit implementations. 


divdu[o] [.] 
unsigned integer divide doubleword 


po 3t de RT ly RA ids RB al Gln 457 a] Roy 


divdu[o][.] RT, RA, RB 


RT — (RA) + (RB) 


The divdu[o][.] instruction divides RA by RB and stores the result into RT. Both the operands 
and the quotient are unsigned doubleword integers. Division by 0 (contents of RB are 0) leaves 


RT undefined. 


If the divduo[.] form of the instruction is used, then the overflow and summary overflow bits 
of the XER are set if overflow occurs. Division by 0 sets the overflow bit. 


If the divdu[o]. form of the instruction is used, then CRO is updated. Division by 0 (contents 
of RB are 0) leaves the LT, GT, and EQ, fields of CRO undefined if this form is used. 


This instruction is only available on 64-bit implementations. 


divw[o] [.] 
integer divide word 


—————E—————ee 


— eee 
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p34 de RT add RA ide RB al Siler 491 of Roy 


divw([o][.] RT, RA, RB 


RT <— (RA) + (RB) 


The divw[o][.] instruction divides RA by RB and stores the result into RT. Both the operands 
and the quotient are signed word integers. Division by 0 (contents of RB are 0) or division of 
the maximum negative integer by -1 leaves RT undefined. 


If the divwo[.] form of the instruction is used, then the overflow and summary overflow bits of 
the XER are set if overflow occurs. Division by 0 or division of the maximum negative integer 
by -1 sets the overflow bit. 


If the divw[o]. form of the instruction is used, then CRO is updated. Division by 0 (contents of 
RB are 0) leaves the LT, GT, and EQ, fields of CRO undefined if this form is used, or division 
of the maximum negative integer by -1. 


divwu[o][.] 


unsigned integer divide word 


jp 3t de RT ds RA ise RB al Gln 459} Ro 


divwu[o][.] RT, RA, RB 


RT — (RA) + (RB) 


The divwu[o][.] instruction divides RA by RB and stores the result into RT. Both the oper- 
ands and the quotient are unsigned word integers. Division by 0 (contents of RB are 0) leaves 


RT undefined. 


If the divwuo[.] form of the instruction is used, then the overflow and summary overflow bits 
of the XER are set if overflow occurs. Division by 0 sets the overflow bit. 


If the divwu[o]. form of the instruction is used, then CRO is updated. Division by 0 (contents 
of RB are 0) leaves the LT, GT, and EQ, fields of CRO undefined if this form is used. 


eciwx 
external control word in 


p31 de RT ands RA ide RB al Gfon 310 sf Mn 


eciwx RT, RA, RB 


The eciwx instruction translates the effective address then sends a load word request for this 
address to the device specified by the RID field in the external access register (EAR) (see 
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Appendix C). If RA is 0 (r0), then the contents of RB are used as the effective address, other- 
wise, the sum of the contents of RA and the contents of RB is used as the effective address. 


This instruction is used to send a real address to a device that does not support translation. A 
status word is received from the device and placed into RT. The cache is always bypassed by 
this instruction. 


ecowx 
external control word out 


po Bh he BP os BA sh RE he PS al Is 


ecowx RT, RA, RB 


The ecowx instruction translates the effective address then sends a store word request for this 
address to the device specified by the RID field in the external access register (EAR) (see Ap- 
pendix C). If RA is 0 (r0), then the contents of RB are used as the effective address; otherwise, 
the sum of the contents of RA and the contents of RB is used as the effective address. 


This instruction is used to send a real address along with a control word (the contents of reg- 
ister RT) to a device (which does not suport translation). The cache is always bypassed by this 
instruction. 

eieio 

enforce in-order execution of I/O 


Ee a 


eieio 


The eieio instruction forms a fence between certain storage operations. Operations that occur 
before the eieio instruction are completed with respect to main storage before operations within 
the same group that follow the eieio are initiated. Operations are split into two groups and 
ordering only occurs within each group. 


The first group includes loads and stores to storage that are both cache inhibited and guarded, 
and stores to storage that are write-through required (see Virtual Memory in Appendix C). 


The second group includes stores to storage that are not cache inhibited, not write-through 
required, and memory coherent. 


No ordering is forced between accesses from different groups. For stronger ordering, see the 
sync instruction. 


eqv|.] 
logical equivalent (XNOR) 


ee eee 
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ot de RA dy RT le RB alr 284 nol RS 


eqv[.] RA, RT, RB 


RT<— (RA) ® (RB) 


The eqyv[.] instruction performs a bitwise logical equivalence between the contents of RA and 
the contents of RB, placing the result into RT. 


If the record bit is set (eqv.), then CRO is updated. 


extldi(.] 
extract and left justify doubleword immediate 


p30 ad BA su RTishe © solar ™! soles] Pal Res 


extiIdi RT, RA, n, b (n>0) 


RT — (RA), 44,-1%-"0 


The extldi[.] instruction extracts n bits from RA starting at bit position b, left justifies them, 
and places the result into RT (remaining bits in RT are cleared). 


If the Rc bit is set (extldi.), then CRO is updated. 


This is an extended mnemonic for the rotate left doubleword immediate then clear right 
(rldicr[.]) instruction: 


rldicr[.] RT, RA, b, n-1 
This instruction is defined only for 64-bit implementations. 


extlwi[.] 
extract and left justify word immediate 


pt ade RA solu RTshic © solar 9 aslo ™! wl Ren 
extlwi[.] RT, RA, n, b (n>0) 


RT — (RA), 44,-1 11°27 "0 


The extlwi[.] instruction extracts n bits from RA starting at bit position b, left justifies them, 
and places the result into RT (remaining bits in RT are cleared). 


If the Rc bit is set (extlwi.), then CRO is updated. 


Zt 
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This is an extended mnemonic for the rotate left word immediate then AND with mask 
(rlwinm[.]) instruction: 


rlwinm[.] RT, RA, b, 0, n-1 


extrdi[.] 
extract and right justify doubleword immediate 


p30 aR RT afi CO¥M 4] 4M asOa] OFM uf Rey, 


extrdi[.] RT, RA, n, b (n>0) 


RT 64-1 || (RA), pan—1 


The extrdi[.] instruction extracts n bits from RA starting at bit position b, right justifies them, 
and places the result into RT (remaining bits in RT are cleared). 


If the Rc bit is set (extrdi.) then CRO is updated. 


This is an extended mnemonic for the rotate left doubleword immediate then clear right 
(rldicl[.]) instruction: 


ridicl[.] RT, RA, b+n, 64-n 
This instruction is defined only for 64-bit implementations. 


extrwi(.] 
extract and right justify word immediate 


pte PA ls Rishi © soln © ads! w] Ry 


extrwi[.] RT, RA, n, b (n>0) 


RT « 32-10 l| (RA), ba n—1 


The extrwi[.] instruction extracts n bits from RA starting at bit position b, left justifies them, 
and places the result into RT (remaining bits in RT are cleared). 


If the Rc bit is set (extrwi.) then CRO is updated. 


This is an extended mnemonic for the rotate left word immediate then AND with mask 
(rlwinm[.]) instruction: 


rlwinm[.] RT, RA, b+n, 32-n, 31 


extsb[.] 
sign-extend byte 


Ove 


—EE 


ee 


EEE 
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» 954 of Rey 


extsb[.] RT, RA 


RT — * (RA) 4Il (RA) 24 31 


The extsb[.] instruction sign extends the least significant byte of RA then stores the result 
into RT. 


If the Rc bit is set (extsb.), then CRO is updated. Note that on 64-bit implementations, 
RT<-°(RA)s6__(RA)56:63- 


extsh[.] 
sign-extend half-word 


» 922g Rey 


extsh[.] RT, RA 


RT — '°(RA),6 ll (RA)i6, 31 


The extsh[.] instruction sign extends the least significant half-word of RA then stores the result 
into RT. 


If the Re bit is set (extsh.), then CRO is updated. Note that on 64-bit implementations, 
RT<-""(RA)4g(RA)4a.63, 


extsw[.| 
sign-extend word 


» 986d Rey 


extsw[.] RT, RA 
RT — * (RA) 5 II (RA) 39. 63 


The extsw[.] instruction sign extends the least significant word of RA then stores the result 
into RT. 


If the Rc bit is set (extsw.), then CRO is updated. 


This instruction is defined only on 64-bit implementations. 
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fabs[.] 
floating-point absolute value 


o 3 Ne FRT dds udu FRAwin 264 Re, 


fabs[.] FRT, FRA 


FRT < |(FRA)| 


The fabs[.] instruction calulates the absolute value of the contents of floating-point register 
FRA and stores this into floating-point register FRT. 


If the Rc bit is set (fabs.), then CR1 is updated. 


fadd[.] 
floating-point add double precision 


fadd[.] FRT, FRA, FRB 


FRT <— (FRA) + (FRB) 


The fadd[.] instruction performs a double precision add of the contents of floating-point reg- 
isters FRA and FRB, and stores this into floating-point register FRT. 


If the Rc bit is set (fadd.), then CR1 is updated. 
The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF, FR, FI, FX, OX, UX, XX, VXSNAN, and VXISI 


fadds[.] 
floating-point add single precision 


fadds[.] FRT, FRA, FRB 


FRT — (FRA) + (FRB) 


The fadds[.] instruction performs a double precision add of the contents of floating-point reg- 


isters FRA and FRB, rounds this to single precision, and then stores it into floating-point 
register FRT. 


If the Re bit is set (fadds.), then CR1 is updated. 


LE SS SSeS 


eee 
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The following FPSCR fields may be updated by this instruction (see Appendix D)): 
FPRF, FR, FI, FX, OX, UX, XX, VXSNAN, and VXISI 
fcfid[.] 


convert integer doubleword to floating-point 


5 3 de FRT ils /ishisFRAmin 46nd Rey 


fcfid[.] FRT, FRA 


The fcfid[.] instruction converts the contents of floating-point register FRA from a doubleword 
integer into a double precision floating-point value and stores this into floating-point register 
FRT. 


If the Rc bit is set (fcfid.), then CR1 is updated. 

The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF, FR, FI, FX, and XX 

This instruction is available only on 64-bit implementations. 


fcmpo 
floating-point compare ordered 


fempo bf, FRA, FRB 





CR yy — (FRA) < (FRB) || (FRA) > (FRB) || (FRA) = (FRB) || unordered 


The fcmpo instruction compares the contents of floating-point registers FRA and FRB and 
stores the result into condition register field bf (CRbf). Unordered (bit three of the CR field) 


is set if either of the operands is not a number (NaN). 
The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPCC, FX, VXSNAN, and VXVC 


fempu 
floating-point compare unordered 


p 3 leFale Zs FRArlgFRBaf ol / 
fempu bf, FRA, FRB 


CRyy — (FRA) < (FRB) || (FRA) > (FRB) || (FRA) = (FRB) || unordered 
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The fempu instruction compares the contents of floating-point registers FRA and FRB and 
stores the result into condition register field bf (CRbf). Unordered (bit three of the CR field) 
is set if either of the operands is not a number (NaN). 


The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPCC, FX, and VXSNAN 
fctid|.] 


convert floating-point to integer doubleword 


ge FR ly fle FRAmly 814 ag Rey 


fctid[.] FRT, FRA 


The fctid[.] instruction converts the contents of floating-point register FRA from a double 
precision floating-point number into a signed doubleword integer value and stores this into 
floating-point register FRT. 


If the Re bit is set (fctid.) then CR1 is updated. 

The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF (undefined), FR, FI, FX, XX, VXSNAN, and VXCVI 

This instruction is available only on 64-bit implementations. 


fctidz[.] 


convert floating-point to integer doubleword with round toward zero 


0 63 516 FRT 10 21 815 30 Re, 


fctidz[.] FRT, FRA 


The fctidz[.] instruction converts the contents of floating-point register FRA from a double 
precision floating-point number into a signed doubleword integer value and stores this into 
floating-point register FRT. The rounding mode used in this instruction is round toward Zero, 


regardless of the setting in the FPSCR (see Appendix D). 

If the Rc bit is set (fctidz.) then CR1 is updated. 

The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF (undefined), FR, FI, FX, XX, VXSNAN, and VXCVI 


This instruction is available only on 64-bit implementations. 
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fctiw[.] 
convert floating-point to integer word 


» 3d FR hs /ishsFRAmin 4 nol Rey 


fctiw[.] FRT, FRA 


The fctiw[.] instruction converts the contents of floating-point register FRA from a double 
precision floating-point number into a signed word integer value and stores this into the low- 


order 32 bits of floating-point register FRT. 

If the Rc bit is set (fetiw.) then CR1 is updated. 

The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRE (undefined), FR, FI, FX, XX, VXSNAN, and VXCVI 


fctiwz[.| 
convert floating-point to integer word with round toward zero 


5 3 PRT ds isle FRAwy 1S al Res 


fctiwz[.] FRT, FRA 


The fctiwz[.] instruction converts the contents of floating-point register FRA from a double 
precision floating-point number into a signed word integer value and stores this into the low- 
order 32 bits of floating-point register FRT. The rounding mode used in this instruction is 
round toward 0, regardless of the setting in the FPSCR (see Appendix D). 


If the Rc bit is set (fctiwz.) then CR1 is updated. 
The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRE (undefined), FR, FI, FX, XX, VXSNAN, and VXCVI 


fdiv(.] 
floating-point divide double precision 


fdiv[.] FRT, FRA, FRB 
FRT <— (FRA) + (FRB) 


The fdiv[.] instruction performs a double precision divide of the contents of floating-point 
registers FRA and FRB (FRA divided by FRB), and stores this into floating-point register FRT. 
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If the Rc bit is set (fdiv.) then CR1 is updated. 

The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF, FR, FI, FX, OX, UX, ZX, XX, VXSNAN, VXIDI, and VXZDZ 

fdivs[.] 


floating-point divide single precision 


fdivs[.] FRT, FRA, FRB 


FRT — (FRA) + (FRB) 


The fdivs[.] instruction performs a double precision divide of the contents of floating-point 
registers FRA and FRB (FRA divided by FRB), rounds this to single precision, and stores the 
result into floating-point register FRT. 


If the Rc bit is set (fdivs.) then CR1 is updated. 
The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF, FR, FI, FX, OX, UX, ZX, XX, VXSNAN, VXIDI, and VXZDZ 


fmadd|[.] 
floating-point multiply accumulate double precision 


fmadd[.] FRT, FRA, FRB, FRB 


FRT — ( (FRA) x (FRC) ) + (FRB) 


The fmadd[.] instruction performs a double precision multiply of the contents of floating-point 
registers FRA and FRC, adds to this the contents of floating-point register FRB, and stores this 


into floating-point register FRT. The intermediate result of FRA times FRC is not rounded 
before the addition (see Appendix D). 


If the Rc bit is set (fmadd.) then CR1 is updated. 

The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF, FR, FI, FX, OX, UX, XX, VXSNAN, VXISI, and VXIMZ 

fmadds[.| 


floating-point multiply accumulate single precision 
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fmadds[.] FRT, FRA, FRB, FRB 


FRT <— ( (FRA) X (FRC) ) + (FRB) 


The fmadds[.] instruction performs a double precision multiply of the contents of floating- 
point registers FRA and FRC, and adds this to the contents of floating-point register FRB. 
This result is rounded to single precision and then stored into floating-point register FRT. The 
intermediate result of FRA times FRC is not rounded before the addition (see Appendix D). 


If the Rc bit is set (fmadds.) then CR1 is updated. 

The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRE, FR, FI, FX, OX, UX, XX, VXSNAN, VXISI, and VXIMZ 

fmr[.] 


floating-point move register 


5 3D FRT ls J sls FRAmdy 72d Re 


fmr[.] FRT, FRA 


FRT — (FRA) 


The fmr[.] instruction moves the contents of floating-point register FRA into floating-point 


register FRT. 
If the Rc bit is set (fmr.), then CR1 is updated. 


fmsub[.] 
floating-point multiply subtract double precision 


fmsub[.] FRT, FRA, FRC, FRB 


FRT <— ( (FRA) Xx (FRC) ) — (FRB) 


The fmsub[.] instruction performs a double precision multiply of the contents of floating-point 
registers FRA and FRC, subtracts from this the contents of floating-point register FRB, and 
stores this into floating-point register FRT. The intermediate result of FRA times FRC is not 
rounded before the addition (see Appendix D). 


If the Rc bit is set (fmsub.), then CR1 is updated. 
The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRE, FR, FI, FX, OX, UX, XX, VXSNAN, VXISI, and VXIMZ 
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fmsubs[.] 
floating-point multiply subtract single precision 


fmsubs[.] FRT, FRA, FRC, FRB 


FRT — ( (FRA) x (FRC) ) — (FRB) 


The fmsubs[.] instruction performs a double precision multiply of the contents of floating- 
point registers FRA and FRC, subtracts from this the contents of floating-point register FRB. 
This result is then rounded to single precision and then stored into floating-point register FRT. 


The intermediate result of FRA times FRC is not rounded before the addition (see Appendix 
D). 


If the Rc bit is set (fmsubs.), then CR1 is updated. 


The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF, FR, FI, FX, OX, UX, XX, VXSNAN, VXISI, and VXIMZ 


fmul[.] 


fmul[.] FRT, FRA, FRC 


FRT — (FRA) x (FRC) 


floating-point multiply double precision 


The fmul[.] instruction performs a double precision multiply of the contents of floating-point 
registers FRA and FRC, then stores this into floating-point register FRT. 


If the Re bit is set (fmul.), then CR1 is updated. 
The following FPSCR fields may be updated by this instruction (see Appendix D);: 
FPRF, FR, FI, FX, OX, UX, XX, VXSNAN, and VXIMZ 


fmuls[.] 
floating-point multiply single precision 


fmuls[.] FRT, FRA, FRC 





FRT — (FRA) x (FRC) 
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The fmuls[.] instruction performs a double precision multiply of the contents of floating-point 
registers FRA and FRC. This result is then rounded to single precision and then stored into 
floating-point register FRT. 


If the Rc bit is set (fmuls.), then CR1 is updated. 
The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRE, FR, FI, FX, OX, UX, XX, VXSNAN, and VXIMZ 


fnabs[.| 
floating-point negative absolute value 


5 3d FRT dhs / us FRArls 136 od Res 


fnabs[.] FRT, FRA 


FRT < |(FRA)| 


The fnabs{.] instruction calulates the absolute value of the contents of floating-point register 
FRA, negates it, then stores it into floating-point register FRT. 


If the Rc bit is set (fnabs.), then CR1 is updated. 


fneg[.] 
floating-point negate 


5 8 Je PRT ds she FRAwd 49 


fneg FRT, FRA 


FRT — (FRA) 


The fneg[.] instruction negates the contents of floating-point register FRA and stores this into 
floating-point register FRT. 


If the Rc bit is set (fneg.), then CR1 is updated. 


fnmadd|[.] 
floating-point negate multiply accumulate double precision 


fnmadd[.] FRT, FRA, FRC, FRB 


FRT <— —(( (FRA) X (FRC) ) + (FRB) ) 
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The fnmadd[.] instruction performs a double precision multiply of the contents of floating- 
point registers FRA and FRC, adds to this the contents of floating-point register FRB, negates 
this result, then stores it into floating-point register FRT. The intermediate result of FRA times 


FRC is not rounded before the addition (see Appendix D). 

If the Rc bit is set (fnmadd.), then CR, is updated. 

The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF, FR, FI, FX, OX, UX, XX, VXSNAN, VXISI, and VXIMZ 


fnmadds[.] 
floating-point negate multiply accumulate single precision 


fnmadd[.] FRT, FRA, FRC, FRB 


FRT — —(( (FRA) x (FRC) ) + (FRB) ) 


The fnmadds[.] instruction performs a double precision multiply of the contents of floating- 
point registers FRA and FRC, and adds this to the contents of floating-point register FRB. 
This result is then negated, rounded to single precision, and then stored into floating-point 
register FRT’. The intermediate result of FRA times FRC is not rounded before the addition 
(see Appendix D). 


If the Rc bit is set (fnmadds.), then CR, is updated. 
The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF, FR, FI, FX, OX, UX, XX, VXSNAN, VXISI, and VXIMZ 


fnmsub[.] 
floating-point negate multiply subtract double precision 


fnmadd[.] FRT, FRA, FRC, FRB 


FRT — —( (FRA) x (FRC) ) + (FRB) ) 


The fnmsub[.] instruction performs a double precision multiply of the contents of floating- 
point registers FRA and FRB, subtracts this from the contents of floating-point register FRB, 
negates the result, then stores it into floating-point register FRT. The intermediate result of 
FRA times FRB is not rounded before the subtraction(see Appendix D). 


If the Rc bit is set (fnmsub.) then CR1 is updated. 
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The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRE, FR, FI, FX, OX, UX, XX, VXSNAN, VXISI, and VXIMZ 


fnmsubs[.| 
floating-point negate multiply subtract single precision 


fnmsubs[.] FRT, FRA, FRC, FRB 


FT <— —( (FRA) X (FRC) ) + (FRB)) 


The fnmsubs[.] instruction performs a double precision multiply of the contents of floating- 
point registers FRA and FRC, and subtracts this from the contents of floating-point register 
ERB. This result is then negated, rounded to single precision, and then stored into floating- 
point register FRT. The intermediate result of FRA times FRC is not rounded before the sub- 
traction (see Appendix D). 


If the Rc bit is set (fnmsubs.), then CR, is updated. 

The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRE, FR, FI, FX, OX, UX, XX, VXSNAN, VXISI, and VXIMZ 

fres[.| 


floating-point reciprocal estimate single precision 


fres[.] FRT, FRA 


l 
FRT fFRA) 


The fres|.] instruction estimates the reciprocal of the contents of floating-point register FRA. 
This result is then rounded to single precision and stored into floating-point register FRT. The 
accuracy of the estimate is as follows: 





estimate — a 
(FRA)| 


1. | 256 
(FRA) 
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If the Rc bit is set (fres.), then CR1 is updated. 
The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF, FR (undefined), FI (undefined), FX, OX, UX, ZX, and VXSNAN. 


This instruction is optional and may not be implemented in every processor. Check the Users’ 
Manual of the implementation you are using. 


frsp[.] 


floating-point round to single precision 


o 3 de FRE gly / sls FRAdn 12 sol RO 


frsp[.] FRT, FRA 


The frsp[.] instruction converts the contents of floating-point register FRA from a double pre- 
cision floating-point number into a single precision floating-point value and stores this into 
floating-point register FRT. 


If the Re bit is set (frsp.), then CR1 is updated. 
The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF, FR, FI, FX, OX, UX, XX, and VXSNAN 


frsqrte[.] 
floating-point estimate reciprocal square root double precision 


frsqrte[.] FRT, FRA 


l 
FRT TERA) 


The fsqrte[.] instruction estimates the reciprocal of the square root of the contents of floating- 
point register FRA. This result is then stored into floating-point register FRT. The accuracy of 
the estimate is as follows: 


estimate — FRA) : 
ee} 


(FRA) 
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If the Rc bit is set (fsqrte.) then CR1 is updated. 
The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRE, FR (undefined), FI (undefined), FX, ZX, VXSNAN, and VXSQRT. 


This instruction is optional and may not be implemented in every processor. Check the Users’ 
Manual of the implementation you are using. 


fsel[.] 


floating-point select 


fsel[.] FRT, FRA, FRB, FRB 


FRT — ( (FRA) 20) ? (FRB) : (FRC) 


The fsel[.] instruction moves the contents of FRB into FRT if the contents of FRA are greater 
than or equal to zero; otherwise, it moves the contents of FRC into FRY. 


If the Rc bit is set (fsel.), then CR1 is updated. 


This instruction is optional and may not be implemented in every processor. Check the Users’ 
Manual of the implementation you are using. 


fsqrt[.] 


floating-point square root double precision 


frsqrt[.] FRT, FRA 
FRT <—|(FRA,) 


The fsqrt[.] instruction calculates the square root of the contents of floating-point register FRA 
and stores the result into floating-point register FRT. 


If the Rc bit is set (fsqrt.), then CR1 is updated. 
The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF, FR, FI, FX, XX, VXSNAN, and VXSQRT. 


This instruction is optional and may not be implemented in every processor. Check the Users’ 
Manual of the implementation you are using. 
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fsqrts[.] 


floating-point square root single precision 
frsqrts[.] FRT, FRA 


FRT —|(FRA) 


The fsqrts[.] instruction calculates the square root of the contents of floating-point register FRA. 


This result is then rounded to single precision and stored into floating-point register FRT. 
If the Re bit is set (fsqrts.), then CR1 is updated. 

The following FPSCR fields may be updated by this instruction (see Appendix D): 

FPRF, FR, FI, FX, XX, VXSNAN, and VXSQRT. 


This instruction is optional and may not be implemented in every processor. Check the Users’ 
Manual of the implementation you are using. 


fsub[.] 
floating-point subtract double precision 


fsub[.] FRT, FRA, FRB 


FRT — (FRA) — (FRB) 


The fsub[.] instruction performs a double precision subtraction of the contents of floating- 
point register FRB from the contents of floating-point register FRA, and stores the result into 
floating-point register FRT. 


If the Re bit is set (fsub.), then CR1 is updated. 
The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF, FR, FI, FX, OX, UX, XX, VXSNAN, and VXISI 


fsubs[.] 
floating-point subtract single precision 


fsubs[.] FRT, FRA, FRB 


FRT — (FRA) — (FRB) 
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The fsubs[.] instruction performs a double precision subtraction of the contents of floating- 
point register FRB from the contents of floating-point register FRA. This result is then rounded 
to single precision and stored into floating-point register FRT. 


If the Rc bit is set (fsubs.) then CR1 is updated. 

The following FPSCR fields may be updated by this instruction (see Appendix D): 
FPRF, FR, FI, FX, OX, UX, XX, VXSNAN, and VXISI 

icbi 

instruction cache block invalidate 


ot de 7 ody RA she RB ln 982 o/s 


icbi RA, RB 


The icbi instruction invalidates a cache block. If RA is 0 (r0), then the contents of RB are used 
as the effective address; otherwise, the sum of the contents of RA and RB is used as the effec- 
tive address. The cache block addressed by this effective address is invalidated from the instruc- 
tion cache. 


The exact action taken depends on whether memory coherence is enforced for that address 
(see Virtual Memory), and on the state of that address in the cache. 
1. Coherence required 


unmodified block: Invalidate all copies of the block in the instruction caches of all 
processors. 


absent block: If the block is absent from the cache, then invalidate all copies of the 
block in the instruction caches of all other processors. 


2. Coherence not required 


unmodified block: Invalidate all copies of the block in the instruction caches of this 
processor. 


absent block: Do nothing. 


inslwi[.] 
insert from left word immediate 


inslwi RT, RA, n, b (n>0; (b+n)<32) 





RT — (RT) 9 4-1 RA) 9 p-1 WARD wan_y,31 
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The inslwi[.] instruction extracts n bits from RA starting at bit position 0, and inserts them 
into RT starting at bit position b (remaining bits in RT are unchanged). 


If the Rc bit is set (inslwi.) then CRO is updated. 


This is an extended mnemonic for the rotate left word immediate with mask insert (rlwimi[.]}) 
instruction: 


rlwimi[.] RT, RA, 32-b, b, b+n-1 
insrdi[.] 


insert from right doubleword immediate 


insrdi RT, RA, n, b (n>0) 





RT — (RT) 94-1 IE RA) 64-63 WARD) Gan—1.63 


The insrdi[.] instruction extracts the n rightmost bits from RA and inserts them into RT start- 
ing at bit b (remaining bits in RT are unchanged). 


If the Rc bit is set (insrdi.), then CRO is updated. 


This is an extended mnemonic for the rotate left doubleword immediate then mask insert 
(rldimi[.]) instruction: 


rldimi[.] RT, RA, 64-b-n, b 
This instruction is defined only for 64-bit implementations. 


insrwi|[.] 
insert from right word immediate 


p20 ade RA ls RT shi 22% ada, b asfgbrOly] Rey, 


insrwi RT, RA, n, b (n>0; (b+n)<32) 


RT <— (RT) 0,b-1 | (RA) 32 —n,31 l| (RT) (b+n-—1), 31 


The insrwi[.] instruction extracts the n rightmost bits from RA, and inserts them into RT starting 
at bit position b (remaining bits in RT are unchanged). 


If the Rc bit is set (insrwi.), then CRO is updated. 


This is an extended mnemonic for the rotate left word immediate with mask insert (rlwimi[.]}) 
instruction: 


rlwimi[.] RT, RA, 32-b-n, b, b+n-1 
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isync 
instruction synchronize 


ee ee Pe ee 


isync 
The isync instruction waits for all previous instructions to complete and then discards all 
prefetched instructions. This is a context syncronizing instruction (see Appendix C). 


la 


load address 


0 14 516 RT sha: BAe ale D 31 


la RT, D(RA) 


RT & (RA) + ("D6 Il Dig 31) 


The la instruction performs an add of RA to the sign extended 16-bit immediate field, placing 
the result into RT. 


If the la instruction is used with RA = 0, then 0 is used instead of the contents of RA. Thus 
regardless of the contents of r0, addi RT, RA, D, la r1,4(r0) would load the constant 4 


into rl. 


Ibz 


load byte and zero-extend immediate 


4 RT RAs 


Ibz RT, D(RA) 


RT — 240 || (MEM [( (RA) + (19D, I Dye 3), 1) 


The lbz instruction loads the byte contained at the effective address of the instruction into the 
low-order byte of RT. The effective address of the instruction is found by adding the contents 
of RA to the sign extended 16-bit immediate field, unless RA is 0 (r0), in which case the effec- 
tive address is the sign extended immediate field. The remaining bytes in RT are set to 0. 
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Ibzu 
load byte and zero-extend immediate with update 


ae 1 ee ee 


Ibzu RT, D(RA) 


RT — 40 Il (MEM [( (R4) + ('6D,¢ II Dig 3,)), 1) 


RA & (RA) + (Dj, Il Dig 31) 


The Ibzu instruction loads the byte contained at the effective address of the instruction into 
the low-order byte of RT. The effective address of the instruction is found by adding the con- 
tents of RA to the sign extended 16-bit immediate field. The remaining bytes in RT are set to 
0. RA=RT or RA=0 is invalid for this instruction. 


In addition, the effective address of the instruction is loaded into RA. 


lbzux 
load byte and zero-extend with update 


Ee EO ee 
Ibzux RT, RA, RB 


RT — *40 || (MEM [( (RA) + (RB) ), 1]) 
RA <— (RA) + (RB) 


The Ibzux instruction loads the byte contained at the effective address of the instruction into 
the low-order byte of RT. The effective address of the instruction is found by adding the con- 
tents of RA to the contents of RB, unless RA is 0 (r0), in which case the effective address is RB. 
The remaining bytes in RT are set to 0. RA=RT or RA=0 is invalid for this instruction. 


In addition, the effective address of the instruction is loaded into RA. 


Ibzx 
load byte and zero-extend 


l, 


Ibzx RT, RA, RB 


RT — 740 || (MEM [( (RA) + (RB)), 1]) 
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The lbzx instruction loads the byte contained at the effective address of the instruction into the 
low-order byte of RT. The effective address of the instruction is found by adding the contents 
of RA to the contents of RB, unless RA is 0 (r0), in which case the effective address is RB. The 
remaining bytes in RT are set to 0. 


Id 


load doubleword immediate 


p 4 RT PA PS On 


Id RT, DS(RA) 


RT — (MEM [( (RA) + (48DS;¢ Il DSi. 39 ll 20)), 81) 


The Id instruction loads the doubleword contained at the effective address of the instruction 
into RT. The effective address of the instruction is found by adding the contents of RA to the 
sign extended 14-bit immediate field with two binary zeros concatenated to the right, unless 
RA is 0 (r0), in which case the effective address is just the sign extended immediate field with 
two binary zeros concatenated to the right. 


This instruction is available only on 64-bit implementations. 


Idarx 


load doubleword and reserve 


F 


idarx RT, RA, RB 


RT — (MEM [( (RA) + (RB) ), 8 |) 


The Idarx instruction loads the doubleword contained at the effective address of the instruc- 
tion into RT. The effective address of the instruction is found by adding the contents of RA to 
the contents of RB, unless RA is 0 (r0), in which case the effective address is RB. The effective 
address must be a multiple of 8 (doubleword aligned). 


In addition, a reservation is created in this processor for use by the store doubleword condi- 
tional (stdcx.) instruction. An address calculated from the effective address of this instruction 
is associated with the reservation (see Appendix C). This instruction is available only on 64-bit 
implementations. 
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Idu 
load doubleword immediate with update 


ee Lo a ee a 


Idu RT, DS(RA) 
RT — (MEM [( (RA) + (48D5,¢ Il DSi 59 Il 20)), 81) 


RT — (RA) + (DS Il DSi6 56 Il 20) 


The Idu instruction loads the doubleword contained at the effective address of the instruction 
into RT’. The effective address of the instruction is found by adding the contents of RA to the 
sign extended 14-bit immediate field with two binary zeros concatenated to the right. RA=RT 
or RA=0 is invalid for this instruction. 


In addition, the effective address of the instruction is loaded into RA. 


This instruction is available only on 64-bit implementations. 


Idux 
load doubleword with update 


EO PO PP 


Idux RT, RA, RB 


RT — (MEM [( (RA) + (RB) ),8]) 
RA <— (RA) + (RB) 


The Idux instruction loads the doubleword contained at the effective address of the instruction 
into RT. The effective address of the instruction is found by adding the contents of RA to the 
contents of RB. RA=RT or RAE=0 is invalid for this instruction. 


In addition, the effective address of the instruction is loaded into RA. 


This instruction is available only on 64-bit implementations. 


Idx 
load doubleword 


< = — “a ” _ ——~—rooeoeoeeeeeeeeeeeeeeeeeeeeee eee eee ee aI 
—— — = = -" CS ee ea a Raa Oi wT ee ai ri ei eee Se ed FS ee et ee a er ee ees . FV Le  t 
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pt de RT ly RAs RB isle 2! al 


Idx RT, RA, RB 


RT — (MEM [( (RA) + (RB) ), 8] ) 


The Idx instruction loads the doubleword contained at the effective address of the instruction 
into RT. The effective address of the instruction is found by adding the contents of RA to the 
contents of RB, unless RA is 0 (r0), in which case the effective address is just the contents 
of RB. 


This instruction is available only on 64-bit implementations. 


lfd 


load double precision floating-point immediate 


p50 PRT RAs Dn 


Ifd FRT, D(RA) 


FRT — (MEM [( (RA) + (!6D,¢ Il Dig 3:)), 81) 


The lIfd instruction loads the doubleword contained at the effective address of the instruction 
into floating-point register FRT. The effective address of the instruction is found by adding 
the contents of RA to the sign extended 16-bit immediate field, unless RA is 0 (r0), in which 
case the effective address is just the sign extended immediate field. 


lfdu 


load double precision floating-point immediate with update 


PRT RA de 


Ifdu FRT, D(RA) 


FRT — (MEM [( (RA) + ('6D,¢ Il Dig 3,))» 81) 
RA & (RA) + (Dig Il Dig 31) 


The Ifdu instruction loads the doubleword contained at the effective address of the instruction 
into floating-point register FRT. The effective address of the instruction is found by adding 
the contents of RA to the sign extended 16-bit immediate field. If RA=0 the instruction form 
is invalid. 


In addition, The effective address is loaded into register RA. 
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lfdux 
load double precision floating-point with update 


0 31 516 FRT,|,, RA,.|, RB ,, 21 631 30 ‘e 


Idux RT, RA, RB 


FRT — (MEM [( (RA) + (RB) ),8]) 
RA <— (RA) + (RB) 


The Ifdux instruction loads the doubleword contained at the effective address of the instruc- 
tion into floating-point register FRT. The effective address of the instruction is found by add- 
ing the contents of RA to the contents of RB. If RA=0 the instruction form is invalid. 


In addition, the effective address is loaded into register RA. 


lfdx 


load double precision floating-point 


g 31 de FRI RA RB ol 599 wl 


Ifdx RT, RA, RB 


FRT — (MEM [( (RA) + (RB) ), 8] ) 


The lfdx instruction loads the doubleword contained at the effective address of the instruction 
into floating-point register FRT. The effective address of the instruction is found by adding 
the contents of RA to the contents of RB, unless RA is 0 (r0), in which case the effective ad- 
dress is just the contents of RB. 


Ifs 


load single precision floating-point immediate 


ew 2 | 


Ifs FRT, DS(RA) 


FRT — (MEM [( (RA) + (Dj, Il DS,¢ 3,)), 41) 


The Ifs instruction loads the word contained at the effective address of the instruction into float- 
ing-point register FRT’. The effective address of the instruction is found by adding the con- 
tents of RA to the sign extended 16-bit immediate field, unless RA is 0 (r0), in which case the 
effective address is just the sign extended immediate field. The word is treated as a single preci- 
sion floating-point number and extended to double precision before being loaded into FRT. 


PS Pe eS A 2 ee Se a a ee et ar se SSR Ea FE SPS) a eT ee SS aS a ee ae re a a nd Bree he ee ee rr 2 i es oO el ek 
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lfsu 


load single precision floating-point immediate with update 


p49 FRT LRA ie Dn 


Ifsu FRT, D(RA) 


FRT — (MEM [( (RA) + (6p... | Di,31)) 41) 
FRT <— (RA) + (sp, I Die a) 


The Ifsu instruction loads the word contained at the effective address of the instruction into 
floating-point register FRT. The effective address of the instruction is found by adding the 
contents of RA to the sign extended 16-bit immediate field. The word is treated as a single 
precision floating-point number and extended to double precision before being loaded into 
FRT. If RA=0 the instruction form is invalid. 


In addition, the effective address is loaded into register RA. 


lfsux 


load single precision floating-point with update 


ode FRT gh RArsl RB wl 567 al 


Ifsux RT, RA, RB 


FRT — (MEM [( (RA) + (RB) ), 4] ) 
RA <— (RA) + (RB) 


The Ifsux instruction loads the word contained at the effective address of the instruction into 
floating-point register FRT. The effective address of the instruction is found by adding the 
contents of RA to the contents of RB. The word is treated as a single precision floating-point 
number and extended to double precision before being loaded into FRT. If RA=0 the instruc- 


tion form is invalid. 
In addition, the effective address is loaded into register RA. 


Ifsx 


load single precision floating-point 
gfe FRT dy RAI RB ls 535 sod / 
Ifsx RT, RA, RB 


FRT — (MEM [( (RA) + (RB) ), 4 ]) 
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The Ifsx instruction loads the word contained at the effective address of the instruction into 
floating-point register FRT. The effective address of the instruction is found by adding the 
contents of RA to the contents of RB, unless RA is 0 (r0), in which case the effective address is 
just the contents of RB. The word is treated as a single precision floating-point number and 
extended to double precision before being loaded into FRT. 


lha 
load halfword algebraic immediate 


ae PLY eee ee 


Ilha RT, D(RA) 


RT — '°(MEM[((RA)+("%D,¢ Il Dio, 5,)), 21)o Il (MEM[((RA)+(1D 1 Il Dig, 5,)), 21) 


The lha instruction loads the half-word contained at the effective address of the instruction 
into the low-order 2 bytes of RT. This halfword is sign-extended into the upper-order 2 bytes 
of RT. The effective address of the instruction is found by adding the contents of RA to the 
sign extended 16-bit immediate field, unless RA is 0 (r0), in which case the effective address is 
the sign extended immediate field. 


lhau 
load halfword algebraic immediate with update 


0 43 516 RT ii WH li D 31 


Ihau RT, D(RA) 


RT — '°(MEM[((RA)+('°D,, ll Dig ;)), 21)o tl (MEM[((RA)+(9D «II Dig as) 2) 


RA «< (RA) + (6p, . I Bis ai) 


The lhau instruction loads the half-word contained at the effective address of the instruction 
into the low-order 2 bytes of RT. This halfword is sign-extended into the upper-order 2 bytes 
of RT. The effective address of the instruction is found by adding the contents of RA to the 
sign extended 16-bit immediate field. If RA=0 or RA=RT the instruction form is invalid. 


In addition, the effective address is loaded into register RA. 


lhaux 


load halfword algebraic with update 
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0 31 516 RT soli, als BE sale 375 30 - 


Ihaux Rx, Ry, Rz 


RT — '(MEM[((RA) + (RB)), 2])y Il (MEM[((RA) + (RB)), 2]) 
RA <— (RA) + (RB) 


The lhaux instruction loads the half-word contained at the effective address of the instruction 
into the low-order 2 bytes of RT. This halfword is sign-extended into the upper-order 2 bytes 
of RT. The effective address of the instruction is found by adding the contents of RA to the 
contents of RB. If RA=0 or RA=RT the instruction form is invalid. 


In addition, the effective address is loaded into register RA. 


hax 

load halfword algebraic 
Ee ee 
lIhax RT, RA, RB 


RT — '°(MEM[((RA) + (RB)), 2]), ll (MEM[((RA) + (RB)), 2]) 


The lhax instruction loads the half-word contained at the effective address of the instruction 
into the low-order 2 bytes of RT. This halfword is sign-extended into the upper-order 2 bytes 
of RT. The effective address of the instruction is found by adding the contents of RA to the 
contents of RB, unless RA is 0 (r0), in which case the effective address is just the contents 


of RB. 


lhbrx 
load halfword and reverse bytes 


pte RE ids Aisin RB isle 790 wl 


lIhbrx RT, RA, RB 


RT — '°0 || (MEM[((RA) + (RB) + 1), 1]) || (WEM[((RA) + (RB)), 1)) 


The lhbrx instruction loads the half-word contained at the effective address of the instruction 
into the low-order 2 bytes of RT. The effective address of the instruction is found by adding 
the contents of RA to the contents of RB, unless RA is 0 (r0), in which case the effective ad- 
dress is RB. The remaining bytes in RT are set to 0. The two bytes of the loaded halfword are 
reversed (swapped) before being placed into RT. 


EE Ea eS ee ee ee 


EEE 
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hz 
load half-word and zero-extend immediate 


ee a ee 


lhz RT, D(RA) 


RT — 190 || (MEM [( (RA) + (Dj, II Dye 3,)), 21) 


The lhz instruction loads the half-word contained at the effective address of the instruction 
into the low-order 2 bytes of RT. The effective address of the instruction is found by adding 
the contents of RA to the sign extended 16-bit immediate field, unless RA is 0 (r0), in which 
case the effective address is the sign extended immediate field. The remaining bytes in RT are 
set to 0. 


lhzu 
load half-word and zero-extend immediate with update 


ai ee oe ES oleae eee 
Ihzu RT, D(RA) 


RT — 60 || (MEM [( (RA) + (D,¢ Il Dyg 3,)), 21) 


RAe (RA) + (6p, . | Daa) 


The lhzu instruction loads the half-word contained at the effective address of the instruction 
into the low-order 2 bytes of RT. The effective address of the instruction is found by adding 
the contents of RA to the sign extended 16-bit immediate field. If RA=0 or RA=RT the in- 


struction form is invalid. The remaining bytes in RT are set to 0. 
In addition, the effective address of the instruction is loaded into RA. 


lhzux 


load half-word and zero-extend with update 


ode BT dy PA RB isd 341 od / 


Ihzux RT, RA, RB 


RT — '°0 || (MEM [ ( (RA) + (RB) ), 2]) 


RA <— (RA) + (RB) 
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The lhzux instruction loads the half-word contained at the effective address of the instruction 
into the low-order 2 bytes of RT. The effective address of the instruction is found by adding 
the contents of RA to the contents of RB. If RA=0 or RA=RT the instruction form is invalid. 
The remaining bytes in RT are set to 0. 


In addition, the effective address of the instruction is loaded into RA. 


lhzx 
load half-word and zero-extend 


pt de RT igh RArgd RB lig 279 aol 


Ihzx RT, RA, RB 


RT <— '!°0 || (MEM [ ( (RA) + (RB) ), 2]) 


The lhzx instruction loads the half-word contained at the effective address of the instruction 
into the low-order 2 bytes of RT. The effective address of the instruction is found by adding 
the contents of RA to the contents of RB, unless RA is 0 (r0), in which case the effective ad- 
dress is RB. The remaining bytes in RT are set to 0. 


li 


load immediate 


0 14 516 RT rhe alte D 31 


li RT, D 


RT — (6p,, I Dic 


li RT, D 


The li instruction loads the sign extended 16-bit immediate field into RT. This is an extended 
mnemonic for the add immediate (addi) instruction: 


addi RT, r0, D 


lis 
load immediate shifted 


me ae ry? D i 
lis RT, D 


RT — (Dig. 3; Il 60) 


245 


— EEE eS ee 


TO EN eae 


246 


PartIV Wi Appendixes 


The lis instruction loads the 16-bit immediate field into the upper-order bits of RT, zeroing 
the low-order bits of RT. This is an extended mnemonic for the add immediate shifted (addis) 
instruction: 


addis RT, r0, D 


Imw 
integer load multiple word 


ee A ee 


Imw RT, D(RA) 


RT,R;, — (MEM [( (RA) + (D4, Il Dig 31)), (4x G2-RT))]) 


Registers RT through R43) are loaded with consecutive words read from memory starting at the 
effective address of the instruction. The effective address of the instruction is found by adding 
the contents of RA to the sign extended 16-bit immediate field, unless RA is 0 (r0), in which 
case the effective address is the sign extended immediate field. RA must be less than RB; the 
form is invalid if RA is in the range of registers to be loaded. The effective address must be a 
multiple of 4 (word aligned). 


On 64-bit implementations, only the low-order 32 bits of each register are loaded; the high- 
order 32 bits are set to 0. 


Iswi 
load string word immediate 


ode RT RAs ® aly 597 od / 


Iswi RT, RA, n 

N6= (N=0)? 32in 

RT,R,  ,.< (MEM [ (RA), n]) 
(x+q') 


Starting with RT, registers are loaded (only the low-order 4 bytes of 64-bit registers) with n 
consecutive bytes read from memory starting at the effective address of the instruction. The 
effective address of the instruction is the contents of RA, unless RA is 0 (r0), in which case the 
effective address is 0. The instruction is invalid if RA is in the range of registers to be loaded. If 
n=0 32 bytes are loaded. High-order 4 bytes of 64-bit registers are zeroed. Unfilled low-order 
bytes of last register are set to 0. 


Iswx 
load string word indexed 


aaa 


EEO Oe x&»']Ee=ayuVI: EO _ _—<V—nwr ee ee 
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pd RT ich RA|g RB nls 533d / 


Iswx RT, RA, RB 


RT, i XER)s, 31 \* (MEM |[ (RA) + (RB)), XER 45, 31 )) 
——s 


Starting with RT, registers are loaded (only the low-order 4 bytes of 64-bit registers) with con- 
secutive bytes read from memory starting at the effective address of the instruction. The num- 
ber of bytes to be loaded is stored in XER25:31. The effective address of the instruction is the 
contents of RA, unless RA is 0 (r0), in which case the effective address is 0. The instruction is 
invalid if RA is in the range of registers to be loaded. Unfilled low-order bytes of the last reg- 


ister are set to 0. 


lwa 
load word algebraic immediate 


8 RT RA se PS nln 2 


Iwa RT, D(RA) 


RT — 3°(MEM[((RA)+('8DS 6 Il DS,6, 29 170), 41)9 Il (MEM[((RA)+('9DS 6 Il DS16, 29 110), 41) 


The lwa instruction loads the word contained at the effective address of the instruction into 
the low-order 4 bytes of RT. This word is sign-extended into the upper-order 4 bytes of RT. 
The effective address of the instruction is found by adding the contents of RA to the sign ex- 
tended 14 bit immediate field, concatenated on the right with two binary zeros, unless RA is 0 
(r0), in which case the effective address is the sign extended immediate field, concatenated on 
the right with two binary zeros. 


This instruction is defined only for 64-bit implementations. 


lwarx 
load word and reserve 


ad MP ls ly BE le at i 
Iwarx RT, RA, RB 


RT < (MEM [ ( (RA) + (RB) ), 41) 


The Iwarx instruction loads the word contained at the effective address of the instruction into 
RT. The effective address of the instruction is found by adding the contents of RA to the con- 
tents of RB, unless RA is 0 (r0), in which case the effective address is RB. The remaining bytes 
in RT are set to 0. The effective address must be a multiple of 4 (word aligned). 
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In addition, a reservation is created in this processor for use by the store word conditional (stwcx.) 
instruction. An address calculated from the effective address of this instruction is associated 
with the reservation (see Appendix C). 


lwaux 


load word algebraic with update 


jp Ft le RT ly RArsds RZ isle 373d Ms 


Iwaux RT, RA, RB 
RT — **(MEM[((RA) + (RB) + 4])o || (MEM[((RA) + (RB) + 4]) 


RA — (RA) + (RB) 


The lwaux instruction loads the word contained at the effective address of the instruction into 
the low-order 4 bytes of RT. This word is sign-extended into the upper-order 4 bytes of RT. 
The effective address of the instruction is found by adding the contents of RA to the contents 
of RB. If Ra=0 or RA=RT the instruction form is invalid. 


In addition, the effective address is loaded into register RA. 


This instruction is defined only for 64-bit implementations. 


lwax 

load word algebraic 
ee. ae ee 
Iwax RT, RA, RB 


RT — **(MEM[((RA) + (RB)), 4]),!l (MEM[((RA) + (RB)), 41) 


The lwax instruction loads the word contained at the effective address of the instruction into 
the low-order 4 bytes of RT. This word is sign-extended into the upper-order 4 bytes of RT. 
The effective address of the instruction is found by adding the contents of RA to the contents 
of RB, unless RA is 0 (r0), in which case the effective address is just the contents of RB. 


This instruction is defined only for 64-bit implementations. 


lwbrx 


load word and reverse bytes (indexed addressing form) 
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pd RT igs RAs RB als 584 ad / 


Iwbrx RT, RA, RB 


RT — (MEM[((RA) + (RB)) + 3), 1]) ll CWEM[((RA) + (RB)) + 2), 1)) Il 
(MEM[((RA) + (RB)) + 1), 1]) Il (MEM[((RA) + (RB)), 1) 


The lwbrx instruction loads the word contained at the effective address of the instruction into 
RT. The effective address of the instruction is found by adding the contents of RA to the con- 
tents of RB, unless RA is 0 (r0), in which case the effective address is RB. The remaining bytes 
in RT are set to 0. The order of the four bytes of the loaded word are reversed before being 
placed into RT. 


lwz 
load word and zero-extend immediate 


5 2 RT RA, 


Iwz RT, D(RA) 


RT (MEM [( (RA) + (°p,, | Pg) ot) 


The lwz instruction loads the word contained at the effective address of the instruction into 
RT. The effective address of the instruction is found by adding the contents of RA to the sign 
extended 16-bit immediate field, unless RA is 0 (r0), in which case the effective address is the 
sign extended immediate field. 


lwzu 
load word and zero-extend immediate with update 


» 3 RTPA, Dn 


Iwzu RT, D(RA) 


RT~ (MEM [( (RA) + (1D, Il Dig 31)), 41) 


RA « (RA) + (p,, | Dies) 


The Iwzu instruction loads the word contained at the effective address of the instruction into 
RT. The effective address of the instruction is found by adding the contents of RA to the sign 
extended 16-bit immediate field, unless RA is 0 (r0), in which case the effective address is the 
sign extended immediate field. If Ra=0 or RA=RT the instruction form is invalid. 


In addition, the effective address of the instruction is loaded into RA. 
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lwzux 
load word and zero-extend with update 


Le Le ee 


Iwzux RT, RA, RB 


RT — (MEM [ ( (RA) + (RB) ), 4]) 
RAe (RA) + (RB) 


The lwzux instruction loads the word contained at the effective address of the instruction into 
RT. The effective address of the instruction is found by adding the contents of RA to the con- 
tents of RB, unless RA is 0 (r0), in which case the effective address is RB. The remaining bytes 
in RT are set to 0. If Ra=0 or RA=RT the instruction form is invalid. 


In addition, the effective address of the instruction is loaded into RA. 


lwzx 
load word and zero-extend 


pt de BT PAs RB de 3nd 


Iwzx RT, RA, RB 


RT — (MEM [ ( (RA) + (RB), 4]) 


The lwzx instruction loads the word contained at the effective address of the instruction into 
RT. The effective address of the instruction is found by adding the contents of RA to the con- 
tents of RB, unless RA is 0 (r0), in which case the effective address is RB. The remaining bytes 
in RT are set to 0. 


merf 


move condition register field 
P 
merf BF, BFA 


CRer— (CRepa) 


The merf instruction copies condition register field BA (CRBA) into condition register field 
BF (CRgp). 





Yiot RtI 
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mcrfs 
move FPSCR to condition register 


merfs BT, BA 





CRar ae (FPSCRpa) 


The merfs instruction copies FPSCR field by (FPSCRga) into condition register field BT (CRgr). 
The following FPSCR fields may be updated by this instruction (see Appendix D): 


FX, OX, UX, ZX, XX, VXSNAN, VXISI, VXIDI, VXZDZ, VXIMZ, VXVC, VXSOFT, 
VXSQRT, and VXCVI. 


mcrxr 
move XER to condition register 


F 


merxr BF 


CRar <= XERo. 3 
XERp.3 <— OBOOOO 


The merxr instruction copies XER bits 0-3 into condition register field BF (CRgg). In addi- 
tion, XER bits 0-3 are set to 0. 


mfcr 
move from condition register 


ode RT she / ols 9d 


mfcr RT 


RT <— (CR) 


The mfcr instruction copies the contents of the condition register into the register RT. 


Pet 


OOOO OQ) OQ) O00 Ee 
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mfctr 
move from count register (CTR) 


31 RT 288 339 / 


0 5} 6 10} 16 20} 21 30 3 


mfctr RT 


_ 


RT — (CTR) 


The mfctr instruction copies the contents of the count register into the register RT. This is an 
extended mnemonic of the move from SPR (mfspr) instruction: 


mfspr RT, 9 

mffs[.] 

move from FPSCR 
Re © a ee ee ae 
mffs[.] FRT 


FRT___ — (FPSCR) 


32, 63 


The mffs[.] instruction copies the contents of the FPSCR into the low-order 32 bits of float- 
ing-point register FRT. 


If the Re bit is set (mffs.), then CR, is updated. 


mflr 
move from link register (LR) 


0 31 ‘eee 256 20] 21 339 30 a 


mfir RT 


RT — (LR) 


The mflr instruction copies the contents of the link register into the register RT. This is an 
extended mnemonic of the move from SPR (mfspr) instruction: 


mfspr RT, 8 
mfmsr 


move from machine state register 


EOL Oe aE a pe ee ey a RT 9 ee 





Ee PP 


mfmsr RT 


RT — (MSR) 
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The mfmsr instruction copies the contents of the MSR into register RT. 


This instruction is privileged (see Appendix C). 


mfspr 
move from special purpose register 


de RE ide PF ds 389 


mfspr RT 


RT <— (SPR) 


1 


— 


The mfspr instruction copies the contents of the special purpose register identified by the SPR 
field into the register RT (see Table B.5). Note that the decimal values used in the instruction 
mnemonic do not correspond exactly to the SPR field values. This is because the SPR field is 
split with the upper-order 5 bits to the right of the lower-order five bits. 


Table B.5. SPR encodings. 


Special Purpose 
Register 


Integer Exception Register 


(XER) 
Link Register (LR) 
Count Register (CTR) 


Data Storage Interrupt Status 
Register (DSISR) 


Data Address Register (DAR) 
Decrementor (DEC) 


Storage Descriptor Register 1 
(SDR1) 


Machine Status Save/Restore 
Register 0 (SRRO) 


Encoding 
SPR 
Decimal _ field 
l 0x020 
8 0x100 
2 0x120 
18 0x240 
19 0x260 
pv. 0x2C0 
25 0x320 
26 0x340 


Read/ 


Privileged Write 


no 


no 


no 


yes 


yes 
yes 
yes 


yes 


R/W 


R/W 
R/W 
R/W 


R/W 
R/W 
R/W 


R/W 


continues 


£20 


eee EE EEE 
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Table B.5. continued 





Special Purpose 
Register 


Machine Status Save/Restore 
Register 1 (SRR1) 

Software Use SPR 0 (SPRGO) 
Software Use SPR 1 (SPRG1) 
Software Use SPR 2 (SPRG2) 
Software Use SPR 3 (SPRG3) 
Address Space Register (ASR)! 
External Access Register 
(EAR) 

Time Base Lower (TBL) 
Time Base Upper (TBU) 
Processor Version Register 
(PVR) 


Instruction Block Address 


Translation Upper Register 
0 (IBATOU) 


Instruction Block Address 
Translation Lower Register 


0 (IBATOL) 


Instruction Block Address 


Translation Upper Register 
1 (IBAT1U) 


Instruction Block Address 
Translation Lower Register 


1 (IBAT1L) 


Instruction Block Address 


Translation Upper Register 
2 (IBAT2U) 


Instruction Block Address 


Translation Lower Register 
2 (IBAT2L) 


Instruction Block Address 


Translation Upper Register 
3 (IBAT3U) 


Encoding 


SPR 


Decimal _ field 


2/ 


aia 
273 
274 
275 
280 
282 


284 
285 
287 


528 


529 


530 


531 


532 


533 


534 


0x360 


0x208 
0x228 
0x248 
0x268 
0x308 
0x348 


0x388 
0x3A8 
0x3E8 


0x210 


0x230 


0x250 


0x270 


0x290 


0x2B0 


0x2D0 


Read/ 


Privileged Write 


yes 
yes 
yes 
yes 
yes 
yes 
yes 


yes 
yes 
yes 


yes 


yes 


yes 


yes 


yes 


yes 


yes 
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Encoding 
Special Purpose SPR Read/ 
Register Decimal field Privileged Write 
Instruction Block Address 535 0x2F0 yes R/W 
Translation Lower Register 
3 (IBAT3L) 
Data Block Address 536 0x310 yes R/W 
Translation Upper Register 
0 (DBATOU) 
Data Block Address 537 0x330 yes R/W 
Translation Lower Register 
0 (DBATOL) 
Data Block Address 538 0x350 yes R/W 
Translation Upper Register 
1 (DBAT1U) 


Data Block Address 539 0x370 yes R/W 
Translation Lower Register 


1 (DBATI1L) 


Data Block Address 540 0x390 yes R/W 
Translation Upper Register 
2 (DBAT2U) 


Data Block Address 541 0x3B0 yes R/W 
Translation Lower Register 


2 (DBAT2L) 


Data Block Address 542 Ox3D0 yes R/W 
Translation Upper Register 
3 (DBAT3U) 


Data Block Address 543 0x3FO yes R/W 
Translation Lower Register 


3 (DBAT3L) 


Data Address Breakpoint 1013. Ox2BF yes R/W 
Register (DABR) 


! 64-bit implementations only. 


——— 


— ee ee es 


——SS_aaaaQQraS 
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mfsr 
move from segment register 


i ae | 


mfsr RT, Sy 


— 


RT < (Sy) 


The mfsr instruction copies the contents of segment register Sy into register RT. 


This instruction is privileged (see Appendix C). This instruction is defined only for 32-bit 
implementations. 


mfsrin 
move from segment register indirect 


ge RE ly / ule RA alr 59d 


mfsrin RT, RA 


—_— 


RT (Sra) ) 


The mfsrin instruction copies the contents of the segment register pointed to by the high- 
order 4 bits of register RA into register RT. 


This instruction is privileged (see Appendix C). This instruction is defined only for 32-bit 
implementations. 


mftb 


move from time base register 
ee 7 a ee ee 
mftb RT, tbr 
RT <— (TB»,) 
The mftb instruction copies the contents of the either the time base or time base upper register 


into the register RT (see Table B.6). Notice that the decimal encoding used in the mnemonic 
is different from the encoding used in the tbr field; the 5-bit halves of the tbr field are reversed. 


— eee —_ 5 =e . rr. 5 i 29 2 = = . ~~ i Se eS ae of 9 = +. ars Se ee ee ee ee 2 Fad ONE Tt oe Be MS ARTS HEA FS pe ee OT) Kd ee) AE eS I) eet 
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Table B.6. TBR encodings for mftb instruction. 


Encoding 
Time Base Regsiter Decimal TBR field Privileged 
Time Base (TB) 268 0x188 no 
Time Base Upper (TBU) 269 0x1A8 no 
mftb 
move from time base register 
0 31 5} 6 RT 10116 392 20] 21 371 30 / 31 
mftb RT 
RT <— (TB) 


The mftb instruction copies the contents of the time base (TB) register into the register RT. 
On 64-bit implementations, this instruction loads the entire 64-bit time base register into RT, 
while on 32-bit implementations only the low-order 32-bits of the time base register are loaded 
into RT. This is an extended mnemonic of the move from time base (mftb) instruction (the 
difference being the number of operands): 


mftb RT, 268 
mftbu 


move from time base register upper 


a1 RT 424 371 / 


0 5} 6 10} 16 20} 21 30 31 


mftbu RT 


RT — (TBU) 


The mftbu instruction copies the contents of the time base upper (TBU) register into the reg- 
ister RT. On 64-bit implementations, this instruction loads the upper-order 32-bits of the time 
base register into the low-order 32-bits of register RT. This is an extended mnemonic of the 
move from time base (mftb) instruction: 


mftb RT, 269 
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mfxer 
move from integer exception register (KER) 


0 a 5} 6 mt aha. ae an 439 30 I 


mfxer RT 


RT — (XER) 


The mfxer instruction copies the contents of the fixed point exception register into the register 
RT. This is an extended mnemonic of the move from SPR (mfspr) instruction: 


mfspr RT, 1 


mr{[.] 
move register 


ey ed Bs, 


mr[.] RT, RA 


RT<— (RA) 


mr[.|] RT, RA 


The mr[.] instruction copies the contents of register RA into register RT. This is an extended 
mnemonic for the logical OR (or[.])instruction: 


or[.] RT, RA, RA 
If the Rc bit is set (mr.), then CRO is updated. 


mtcr 
move to condition register 


ee ee aC 
mtcr RA 


CR <— (RA) 


The mter instruction copies the contents of register RA into the condition register. This is an 
extended mnemonic for the move to condition register fields (mtcrf) instruction: 


mtcrf OxFF, RA 


“BF I AT a I a ST le 


eure. ce See f 


eer e 


VST STE RES 
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mtcrf 
move to condition register fields 


ls 


mtcrf FXM, RA 
MASK «< “FXM, Il *FXM, II “FXM; II ... ll “EXM; 


CR — ((RA) aA MASKy — ( (CR) a MASK) 


mtcrf FXM, RA 


The mtcrf instruction copies the contents of register RA into the condition register under the 
control of the immediate field FXM. If there is a one in a bit position in the FXM field, then 
the corresponding field in the condition register is updated from RA. Otherwise, the field is 
left unchanged. 


mtctr 
move to count register (CTR) 


31 RA 288 467 / 


0 5} 6 10} 16 20} 21 30 3 


mtctr RA 


— 


CTR <— (RA) 
The mtctr instruction copies the contents of register RA into the count register. This is an 
extended mnemonic of the move to SPR (mtspr) instruction: 
mtspr RA, 9 


mtfsb0[.| 
move to FPSCR bit 0 (reset FPSCR bit) 


Re, 
mtfsb0[.| BT 
FPSCRpr,< O 


The mtfsbO[.] instruction sets FPSCR bit BT (FPSCRBT) to 0. 
If the Rc bit is set (mrfsb0.), then CR1 is updated. Note that bits 1 and 2 (FEX and VX) can- 


not be explicitly reset. 


2a 


Se 


So] OO) Ee ee 
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mtfsb 1 [.] 
move to FPSCR bit 1 (set FPSCR bit) 


Re, 


mtfsb1[.] BT 


FPSCRpre— l 


The mtfsb1[.] instruction sets FPSCR bit BT (FPSCRBT) to one. 


If the Rc bit is set (mtfsb1.), then CR1 is updated. Note that bits 1 and 2 (FEX and VX) can- 
not be explicitly reset. 

mtfsf].] 

move to FPSCR fields 





mtfsf[.] FLM, FRA 
MASK < (“FLM, II “FLM, |! “FLM, I... ll “FLM.) 


FPSCR <— ((FRA) ,MASK)v_ ((FPSCR) « MASK) 


The mefsf[.] instruction copies the contents of floating-point register FRA into the FPSCR 
under control of the 8-bit immediate field FLM. If there is a one bit in FLM, then the corre- 
sponding 4-bit field of the FPSCR is updated from the contents of FRA. 


If the Re bit is set (mtfsb1.), then CR1 is updated. Note that if field 0 is specified, bits 1 and 2 


retain their meaning on new bit values (summary of exceptions), rather than being assigned 
directly from FRA. 


mtfsfi[.] 
move to FPSCR field immediate 


Re,, 
mtfsfi[.] BF, U 
FPSCRpp <—U 


The mefsfi[.] instruction sets FPSCR field BE (FPSCRgp) to the value in the immediate field U. 


If the Re bit is set (mtfsfi.), then CR1 is updated, Note that if BT=0, bits 1 and 2 retain their 
meaning on the new bit values, rather than being set from U. 
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mtlr 


move to link register (LR) 


31 RA 256 467 / 


0 5} 6 10} 16 20] 21 30 31 


mtir RA 


LR < (RA) 


The mtlr instruction copies the contents of register RA into the link register. This is an ex- 
tended mnemonic of the move to SPR (mtspr) instruction: 


mtspr RA, 8 


mtmsr 
move to machine state register 


pty RA oh / ude / ols 146 ad 


mtmsr RA 


MSR < (RA) 


The mtmsr instruction copies the contents of register RA into the MSR. 
This instruction is privileged (see Appendix C). 


mtspr 
move to special purpose register 


de PAs adn 497_ ad 


mtspr RA 


SPR. < (RA) 


spr 


The mtspr instruction copies the contents of register RA into the special purpose register specified 


by the SPR field (see Table B.5). 


261 
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mtsr 
move to segment register 


p34 ly RAN fSrishs / aol 220 nds 


mtsr Sx, RA 


S, <— (RA) 


The mtsr instruction copies the contents of register RA into segment register S,. 


This instruction is privileged (see Appendix C). This instruction is defined only for 32-bit 
implementations. 


mtsrin 
move to segment register indirect 


EL a ee 


mtsrin RA, RT 


— 


The mtsrin instruction copies the contents of register RA into the segment register pointed to 


by the high-order 4 bits of register RT. 


This instruction is privileged (see Appendix C), This instruction is defined only for 32-bit 
implementations. 


mtxer 
move to integer exception register (XER) 


0 31 5} 6 a a ys waa 30 I 


mtxer RA 


XER <— (RA) 


The mtxer instruction copies the contents of register RA into the fixed point exception regis- 
ter. This is an extended mnemonic of the move to SPR (mtspr) instruction: 


mtspr RA, 1 


mulhd|[.] 
multiply integer doubleword, return high doubleword 


re IO aS TE a? ons | TT i EE § Pe Pn 8 ee eS OR TL ee Be ae Wa a PA i le eS 


Veer ae SSO 2 PRES EE PRL, eee! 


ee ne ee ae ee SN ee eB ee 
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Eee 
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mulhd[.] RT, RA, RB 


RT — ((RA) X (RB) Jo, 63 


The mulhd{.] instruction stores the upper-order 64 bits of the product of the contents of RA 
and the contents of RB into RT. The contents of RA, RB, and the result are signed doubleword 
integers. 


If the mulhd. form of the instruction is used, then CRO is updated. 


This instruction is defined only for 64-bit implementations. 


mulhdu[.] 
multiply integer doubleword unsigned, return high doubleword 


pt de RT du Rashi RB nol / al 9 wf Rea 


mulhdu[.] RT, RA, RB 


RT — ((RA) X (RB) )o 63 


The mulhdu[.] instruction stores the upper-order 64 bits of the product of the contents of RA 
and the contents of RB into RT. The contents of RA, RB, and the result are unsigned 
doubleword integers. 


If the mulhdu. form of the instruction is used, then CRO is updated. 
This instruction is defined only for 64-bit implementations. 


mulhw{[.] 
multiply integer word, return high word 


mulhw[.] RT, RA, RB 


RT — ((RA) X (RB) Jo, 31 


The mulhw[.] instruction stores the upper-order 32 bits of the product of the contents of RA 
and the contents of RB into RT. The contents of RA, RB, and the result are signed word in- 
tegers. 


If the mulhw. form of the instruction is used, then CRO is updated. Note that if Rc=1 and in 
64-bit mode, CRo.2 are undefined. 
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mulhwu[.] : 
multiply integer word unsigned , return high word 


mulhwu[.] RT, RA, RB 


RT — ((RA) x (RB) Jo, 3; 


The mulhwu[,] instruction stores the upper-order 32 bits of the product of the contents of RA 
and the contents of RB into RT. The contents of RA, RB, and their result are unsigned word 
integers. 


If the mulhw, form of the instruction is used, then CRO is updated. 


mulld[o] [.] 
multiply integer doubleword, return low doubleword 


p Eade RT igh RAs RB ol % an 233 af Rey 


mulld[o][.] RT, RA, RB 


RT <— ( (RA) X (RB) )64. 127 


The mulld[o] [.] instruction stores the lower-order 64 bits of the product of the contents of RA 


and the contents of RB into RT. The contents of RA, RB, and their result are double-word 
integers. 


If the mulldo[.] form of the instruction is used, then the overflow and summary overflow bits 
of the XER are set if overflow occurs. 


If the mulld[o]. form of the instruction is used, then CRO is updated. 


This instruction is defined only for 64-bit implementations. 


mulli 
multiply integer immediate, return low word 


ede le a | 
mulli RT, RA, si 


RT et (RA) x (a, I the ai) aoe 
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The mulli instruction stores the lower-order 32 bits of the product of the contents of RA and 
the sign extended immediate value into RT. On 32-bit implementations, the contents of RA 
and the result are signed word integers. On 64-bit implementations, the contents of RA and 
the result are signed doubleword integers. 


mullw[o][.] 


multiply integer word, return low word 


mullw[o][.] RT, RA, RB 


RT — ((RA) X (RB) )32, 63 


The mullw[o]|[.] instruction stores the lower-order 32 bits of the product of the contents of RA 
and the contents of RB into RT. The contents of both RA and RB, and the result are word 
integers. 


If the mullwo[.] form of the instruction is used, then the overflow and summary overflow bits 
of the XER are set if overflow occurs. 


If the mullw[o]. form of the instruction is used, then CRO is updated. 


nand{[.]| 

logical NAND 
Re, 
nand[.] RT, RA, RB 


RT <— (RA) a (RB) 


The nand{.] instruction performs a logical NAND of the contents of register RA with the con- 
tents of register RB. The result of this is then stored into register RT. 


If the Rc bit is set (nand.), then CRO is updated. 
neg[o] [.] 


integer negate 
neg[o][.] RT, RA 


RT — —-(RA) 
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The neg[o][.] instruction stores the arithmetic negation of the contents of RA into RT. 


If the nego[.] form of the instruction is used, then the overflow and summary overflow bits of 
the XER are set if overflow occurs. 


If the neg{o]. form of the instruction is used, then CRO is updated. 


nop 
no operation 


[nothing] 


The nop instruction does nothing. This is an extended mnemonic for the logical OR (or[.]) 
instruction: 


or 0, 0, 0 


nor[.] 


logical NOR 


The nor[.] instruction performs a logical NOR of the contents of register RA with the con- 
eee Be 
nor[.] RT, RA, RB 
RT <— (RA)v (RB) 

tents of register RB. The result of this is then stored into register RT. 


If the Rc bit is set (nor.), then CRO is updated. 


not{.] 
logical NOT 


g 3t ale RA RT ishsRA ml, 124) Rey 


not[.] RT, RA 


RT < (RA) 
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The not[.] instruction performs a logical negation of the contents of register RA. The result of 
this is then stored into register RT. This is an extended mnemonic for the logical NOR (nor{.]) 
instruction: 


nor[.] RT, RA, RA 
If the Rc bit is set (nor.), then CRO is updated. 


or[.] 


logical OR 
4 Af Ry 
or[.] RT, RA, RB 


RT < (RA)v (RB) 


The or[.] instruction performs a logical OR of the contents of register RA with the contents of 
register RB. The result of this is then stored into register RT. 


If the Rc bit is set (or.), then CRO is updated. 


orc[.] 
logical OR with compliment 


G8 fg RAuln BT shi RB aol 422 Rs 
orc[.] RT, RA, RB 


RT < (RA) v (RB) 


The orc[.] instruction performs a logical OR of the contents of register RA with the logical 
negation of the contents of register RB. The result of this is then stored into register RT. 


If the Rc bit is set (orc.), then CRO is updated. 


ori 

logical OR immediate 
» RAs 
ori RT, RA, ui 


RT — (RAW (190 Il ui ) 
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The ori instruction performs a logical OR of the low-order 16 bits of the contents of register 
RA with the immediate field ui; the high-order 16 bits of RA are unchanged. The result of this 
is then stored into register RT. 


oris 


logical OR shifted immediate 


eS OY A | ee 


oris RT, RA, ui 


RT — (RA (ui ll" ) 


The oris instruction performs a logical OR register RA with the immediate field ui. The result 
of this is then stored into register RT. 


rfi 


return from interrupt 


AUG PRCA PR PP 


rfi 


This instruction is used to return control to a user application from an interrupt handler. The 
MSR is updated from SRR1, and control is passed to the instruction at the address contained 
in SRRO (see Appendix C). 


This instruction is privileged and context synchronizing. 


ridcl[.] 
rotate left doubleword, then clear left 


p30 de PArdy RT ie RB kn ™ aly 8 ley 


ridcl[.] RT, RA, RB, MB 


MB = Mb \| mb5 1.95 


RT <— ROT ( (RA), (RB) ) A MASK ( (mbyl|_ mb, 95), 63) 


The rldcl[.] instruction rotates the contents of register RA left by the number of bits specified 
in the low-order 6 bits of register RB, then performs a logical AND of this with a mask consist- 
ing of MB zeros followed by 64-MB ones. This result is then stored into RT. 


If the Rc bit is set (rldcl.), then CRO is updated. 


This instruction is defined only for 64-bit implementations. 
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ridcr{[.] 
rotate left doubleword then clear right 


p30 ats Awd RT he RB ads ™ aby 9 alROy 


ridcr[.] RT, RA, RB, ME 


ME = meg \l_mez7.26 


RT — ROT ( (RA), (RB) ) A MASK (0, (me 6 Il_ me), 95) ) 


The rldcr[.] instruction rotates the contents of register RA left by the number of bits specified 
in the low-order 6 bits of register RB, then performs a logical AND of this with a mask consist- 
ing of ME+1 ones followed by 63-ME zeros. This result is then stored into RT. 


If the Re bit is set (rldicr.), then CRO is updated. 


This instruction is defined only for 64-bit implementations. 


ridic[.] 
rotate left doubleword then clear 


p30 de Ardy RT sls 15 soas ™ ade | Sos Ry 


ridic[.] RT, RA, sh, MB 


MB = mby¢ \|_mb21.25 


R, — ROT ((Ry), sh) AMASK ( ( mby¢ I mb1, 25), (63 — sh) ) 


The rldic[.] instruction rotates the contents of register RA left by sh bits then performs a logi- 
cal AND of this with a 64-bit mask consisting of ones starting at bit MB and continuing (pos- 
sibly wrapping) until bit 64-sh. This result is then stored into RT. 


If the Re bit is set (rldic.) then CRO is updated. 
This instruction is defined only for 64-bit implementations. 


ridicl[.] 


rotate left doubleword immediate then clear left 


p30 ade Asha RT isis S15 aor ™ ele Or] SMoalRE,| 


ridicl[.] Rx, Ry, sh, MB 





MB = mb>6 \| mb>).95 


RT — ROT ( (RA), sh) A MASK ( ( Mb \| mb >, 25)s 63) 
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The rldicl[.] instruction rotates the contents of register RA left by sh bits then performs a logi- 
cal AND of this with a mask consisting of MB zeros followed by 64-MB ones. This result is 
then stored into RT. 


If the Rc bit is set (rldicl.), then CRO is updated. 


This instruction is defined only for 64-bit implementations. 


ridicr[.] 
rotate left doubleword immediate then clear right 


p30 fe RAs RT isls 8P1:5 aos M acl lal Shon RO 


ridicr[.|] RT, RA, sh, ME 





ME = me 46 \| M41 -25 


RT — ROT ( (RA), sh) A MASK (0, (meg Il_ mea), 25) ) 


The rldicr[.] instruction rotates the contents of register RA left by sh bits, then performs a logi- 
cal AND of this with a mask consisting of ME+1 ones followed by 63-ME zeros. This result is 
then stored into RT. 


If the Re bit is set (rldicr.), then CRO is updated. 


This instruction is defined only for 64-bit implementations. 


rldimi[.] 


rotate left doubleword immediate then insert mask 


p30 de RA rclis RE dis 15 als ale 3] BHO [Ry 


ridimi[.] RT, RA, sh, MB 


MB = mb. \| mb) .95 


RT < (ROT 4 ( (RA), sh) A MASK ( (bg || mb , 35), 63-sh) )v ((R7) A MASK ( (mb5, || mb, 55), 63-sh) ) 


The rldimi[.] instruction rotates the contents of register RA left by sh bits, then inserts this 
into register RT under control of a 64-bit mask consisting of ones starting at bit MB and con- 
tinuing (possibly wrapping) until bit 63-sh. Wherever there is a one in the mask, the bit is taken 


from the rotated register RA, and wherever there is a 0, the bit is taken from the contents of 
RT. The result is stored into RT. 


If the Re bit is set (rldimi.) then CRO is updated. 


This instruction is defined only for 64-bit implementations. 
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rlwimi|.] 
rotate left word immediate then insert mask 


rlwimi[.] RT, RA, sh, mb, me 


RT < (ROT ( (RA), sh) s MASK (mb, me) ),v_((RT) ‘ MASK (mb, me) ) 


The rlwimi[.] instruction rotates the contents of register RA left by sh bits, then inserts this 
into register RT under control of a 64-bit mask consisting of ones starting at bit MB and con- 
tinuing (possibly wrapping) until bit 63-sh. Wherever there is a one in the mask, the bit is taken 
from the rotated register RA, and wherever there is a 0, the bit is taken from the contents of 


RT. The result is stored into RT. 
If the Rc bit is set (rlwimi.), then CRO is updated. 


rlwinm|.] 
rotate left word immediate then AND with mask 


rlwinm|[.] RT, RA, sh, mb, me 


RT — (ROT ( (RA), sh) A MASK ( mb, me) ) 


The rlwinm|[.] instruction rotates the contents of register RA left by sh bits then performs a 
logical bitwise AND of this with a 64-bit mask consisting of ones starting at bit MB and con- 
tinuing (possibly wrapping) until bit 63-sh. The result is stored into RT. 


If the Rc bit is set (rlwinm.), then CRO is updated. 


rlwnm|.] 
rotate left word then AND with mask 


p23 le RArd RP ishis RB aafr MO alr ME /ROy 
rlwnm[.] RT, RA, RB, mb, me 


RT <— (ROT ( (RA), (RB) ) A MASK ( mb, me) ) 


The rlwnm|[.] instruction rotates the contents of register RA left by the amount specified in the 
low-order 5 bits of the contents of RB, then performs a logical bitwise AND of this with a 64- 
bit mask consisting of ones starting at bit MB and continuing (possibly wrapping) until bit 63- 
sh. The result is stored into RT. 
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If the Rc bit is set (rlwnm.), then CRO is updated. 


rotld[.] 
rotate left doubleword 





ridcl[.] RT, RA, RB 


RT — ROT ( (RA), (RB) ) 


The rotld[.] instruction rotates the contents of register RA left by the number of bits specified 
in the low-order six bits of register RB, then stores this result into RT. This is an extended 
mnemonic for the rotate left doubleword and clear left (rldcl[.]) instruction: 


rldcl[.] RT, RA, RB, 0 
If the Re bit is set (rldcl.), then CRO is updated. 





This instruction is defined only for 64-bit implementations. 


rotldi[.| 


rotate left doubleword immediate 





rotidi[.] RT, RA, n 


RT — ROT ( (RA), n) 


The rotldi[.] instruction rotates the contents of register RA left by n bits, then stores this result 
into RT. This is an extended mnemonic for the rotate left doubleword and clear left (rldclf.]) 
instruction: 


ridicl[.] RT, RA, n, 0 
If the Re bit is set (rotldi.) then CRO is updated. 


OOOO eee ag 


This instruction is defined only for 64-bit implementations. 


rotlw[.] 
rotate left word 


ee ea ee 
rotiw[.] RT, RA, RB 


RT — ROT ( (RA), (RB) ) 
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The rotlw[.] instruction rotates the contents of register RA left by the amount specified in the 
low-order 5 bits of the contents of RB, then stores this result into RT. This is an extended 
mnemonic for the rotate left word then AND with mask (rlwnm[.]) instruction: 


rlwnm|[.] RT, RA, RB, 0, 31 
If the Rc bit is set (rlwnm.), then CRO is updated. 


rotlwi[.| 
rotate left word immediate 


rotiwi[.] RT, RA, n 





RT — ROT ( (RA), n) 


The rotlwi[.] instruction rotates the contents of register RA left by n bits then store this result 
into RT. This is an extended mnemonic for the rotate left word immediate then AND with 
mask (rlwinm[.]) instruction: 


rlwinm[.] RT, RA, n, 0, 31 
If the Rc bit is set (rotlwi.), then CRO is updated. 
rotrdi[.| 


rotate right doubleword immediate 


rotrdi[.] RT, RA, n 





RT — ROT ( (RA), 7) 


The rotrdi{.] instruction rotates the contents of register RA right by n bits, then stores this 
result into RT. This is an extended mnemonic for the rotate left doubleword and clear left 
(rldicl[.]) instruction: 


ridicl{.] RT, RA, 64-n, 0 
If the Rc bit is set (rotrdi.) then CRO is updated. 


This instruction is defined only for 64-bit implementations. 
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rotrwi|.| 
rotate right word immediate 


op 2t de PA vd BT isis 329 ahs 9 ody 3! dR 


rotrwi[.] RT, RA, n 


RT — ROT ( (RA), (32-n) ) 


The rotrwi[.] instruction rotates the contents of register RA right by n bits, then stores this 
result into RT. This is an extended mnemonic for the rotate left word immediate then AND 
with mask (rlwinm|[.]) instruction: 


rlwinm[.] RT, RA, 32-n, 0, 31 
If the Rc bit is set (rotrwi.) then CRO is updated. 


sc 


system call 





The sc instruction is used to call the system to perform a service. When this instruction is ex- 
ecuted, the system call interrupt handler is initiated. The contents of registers after this instruc- 
tion (when control has been passed back to the program) depends on the register conventions 
of the system call interface. 


slbia 

SLB invalidate all 
i 
slbia 


The slbia instruction invalidates all SLB entries in the processor (see Appendix C). 
This instruction is privileged. 
This instruction is available only on 64-bit implementations. 


This instruction is optional. Check the Users’ Guide for the implementation you are interested in. 
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slbie 
SLB invalidate entry 


slbie RA 





The slbie instruction invalidates an SLB entry (see Appendix C). The effective address of the 
instruction is the contents of RA, and if there is an SLB entry associated with that address in 
the processor, then that entry is invalidated. 


This instruction is privileged. 
This instruction is available only on 64-bit implementations. 


This instruction is optional. Check the Users’ Guide for the implementation you are interested in. 


sld[.] 

shift left doubleword 
Re, 
sld[.] RT, RA, RB 


RT ¢ (RA) « (RB)57, 63 


The sld[.] instruction shifts the contents of register RA left by the amount specified in the low- 
order 7 bits of the contents of RB. This result is stored in RT. Bits shifted out of bit 0 are lost, 
and zeros are supplied into vacated bit positions. 


If the Rc bit is set (sld.), then CRO is updated. 
This instruction is available only on 64-bit implementations. 


sldi[.] 
shift left doubleword immediate 


sldi[.] RT, RA, n (n < 64) 


0 so Rey 
31 





RT — (RA) « n 


The sldi{.] instruction shifts the contents of register RA left by n bits. This result is stored in 
RT. Bits shifted out of bit 0 are lost, and zeros are supplied into vacated bit positions. This is 
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an extended mnemonic for the rotate left doubleword immediate then clear right (rldicr[.]) 
instruction: 


rldicr[.] RT, RA, n, 63-n 
If the Re bit is set (sldi.), then CRO is updated. 


This instruction is available only on 64-bit implementations. 


slw[.] 

shift left word 
Re, 
slw[.] RT, RA, RB 


RT (RA) « (RB) 26, 31 


The slw[.] instruction shifts the contents of register RA left by the amount specified in the low- 
order 6 bits of the contents of RB. This result is stored in RT. Bits shifted out of bit 0 are lost, 
and zeros are supplied into vacated bit positions. 


If the Re bit is set (slw.), then CRO is updated. 


slwi[.] 

shift left word immediate 
Ae A ea ee 
slwi[.] RT, RA, n (n < 32) 


RT — (RA) « n 


The slwi[.] instruction shifts the contents of register RA left by n bits. This result is stored in 
RT. Bits shifted out of bit 0 are lost, and zeros are supplied into vacated bit positions. This is 


an extended mnemonic for the rotate left word immediate then AND with mask (rlwinm[.]) 
instruction: 


rlwinm[.] RT, RA, n, 0, 31-n 
If the Rc bit is set (slwi.), then CRO is updated. 
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srad[.] 
shift right algebraic doubleword 


Re, 


srad[.] RT, RA, RB 
RT (RA) » (RB)s57, 63 


The srad[.] instruction shifts the contents of register RA right by the amount specified in the 
low-order 7 bits of the contents of RB. This result is stored in RT. If any one bits are shifted 
out of bit position 63 and RA is negative, then the XERCA bit is set. The sign bit of RA (bit 0) 
is supplied into vacated bit positions. 


If the Rc bit is set (srad.), then CRO is updated. 
This instruction is available only on 64-bit implementations. 


sradi|[.] 
shift right algebraic doubleword immediate 





sradi[.] RT, RA, n 


RT <— (RT) » n 


The sradi[.] instruction shifts the contents of register RA right by n bits. This result is stored in 
RT. If any one bits are shifted out of bit position 63 and RA is negative, then the XERCA bit 
is set. The sign bit of RA (bit 0) is supplied into vacated bit positions. 


If the Rc bit is set (sradi.) then CRO is updated. 
This instruction is available only on 64-bit implementations. 


sraw|.] 


shift right algebraic word 
Re, 
sraw|.] RT, RA, RB 


RT &— (RA) » (RB) 6, a4 


The sraw[.] instruction shifts the contents of register RA right by the amount specified in the 
low-order 6 bits of the contents of RB. This result is stored in RT. If any one bits are shifted 
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out of bit position 31 and RA is negative, then the XERCA bit is set to one. Copies of the sign 
bit of RA (RA bit 0) are supplied into vacated bit positions. 


If the Rc bit is set (sraw.) then CRO is updated. 


srawi|.| 


shift right algebraic word immediate 


AP Fe Fc PI Pe 


srawi[.] RT, RA, n 
RT — (RA) » n 
The srawi[.] instruction shifts the contents of register RA right by n bits. This result is stored 


in RT. If any one bits are shifted out of bit position 31 and RA is negative, then the XERCA 
bit is set to one. Copies of the sign bit of RA (RA bit 0) are supplied into vacated bit positions. 


If the Rc bit is set (sraw.), then CRO is updated. 


srd[.] 

shift right doubleword 
Re, 
srd[.] RT, RA, RB 


RT (RA) » (RB)57, 63 


The srd[.] instruction shifts the contents of register RA right by the amount specified in the 
low-order 7 bits of the contents of RB. This result is stored in RT. Bits shifted out of bit posi- 
tion 63 are lost, and zeros are supplied into vacated bit positions. 


If the Re bit is set (srd.), then CRO is updated. 


This instruction is available only on 64-bit implementations. 


srdi[.] 
shift right doubleword immediate 


srdi[.] RT, RA, n 





RT — (RA) » n 
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The srdi[.] instruction shifts the contents of register RA right by n bits. This result is stored in 
RT. Bits shifted out of bit position 63 are lost, and zeros are supplied into vacated bit posi- 
tions. This is an extended mnemonic for the rotate left doubleword immediate then clear left 
(ridicl[.]) instruction: 


ridicl[.] RT, RA, 64-n, n 
If the Rc bit is set (srdi.), then CRO is updated. 


This instruction is available only on 64-bit implementations. 


stw|.]| 

shift right word 
Re, 
srw[.] RT, RA, RB 


RT < (RA) » (RB) 06, 31 


The srw[.] instruction shifts the contents of register RA right by the amount specified in the 
low-order 6 bits of the contents of RB. This result is stored in RT. Bits that are shifted out of 
bit position 31 are lost, and zeros are supplied into vacated bit positions. 


If the Rc bit is set (srw.), then CRO is updated. 


srwil.| 
shift right word immediate 


srwil[.] RT, RA, n 





RT <— (RA) » n 


The srwi[.] instruction shifts the contents of register RA right by n bits. This result is stored in 
RT. Bits shifted out of bit position 31 are lost, and zeros are supplied into vacated bit posi- 
tions. This is an extended mnemonic for the rotate left word immediate then AND with mask 
(rlwinm|[.]) instruction: 


rlwinm[.] RT, RA, 32-n, n, 31 
If the Re bit is set (srwi.), then CRO is updated. 
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stb 


store byte immediate 


ee i a 


stb RS, D(RA) 


MEM [ ( (RA) + (!6Di¢ Il Dye 3,))s 11 — RS)24, 31 


The stb instruction stores the least significant byte contained in RS at the effective address of 
the instruction. The effective address of the instruction is found by adding the contents of RA 
to the sign extended 16-bit immediate field, unless RA is 0 (r0), in which case the effective 
address is the sign extended immediate field. 


stbu 
store byte immediate with update 


ee a ck 
stbu RS, D(RA) 
MEM [ ( (RA) + (‘Sp,, I Dike): 1] — (RS)o4, 31 


RA <— (RA) + (D,, | Digi) 


The stbu instruction stores the least significant byte contained in RS at the effective address of 
the instruction. The effective address of the instruction is found by adding the contents of RA 
to the sign extended 16-bit immediate field. If RA=0 the instruction form is invalid. 


In addition, the effective address of the instruction is loaded into RA. 


stbux 

store byte with update 
Ree a ee 
stbux RS, RA, RB 


MEM [ ( (RA) + (RB) ), 1] — (RS)o4_ 31 


RA <— (RA) + (RB) 
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The stbux instruction stores the least significant byte contained in RS at the effective address 
of the instruction. The effective address of the instruction is found by adding the contents of 
RA to the contents of RB. If RA=0 the instruction form is invalid. 


In addition, the effective address of the instruction is loaded into RA. 


stBT 
store byte 


pF de BS idly RA isl RB iss 215d / 


stbx RS, RA, RB 


MEM [( (RA) + (RB) ), 1] — (RS)o4, 31 


The stBT instruction stores the least significant byte contained RS at the effective address of 
the instruction. The effective address of the instruction is found by adding the contents of RA 
to the contents of RB, unless RA is 0 (r0), in which case the effective address is RB. 


std 


store doubleword immediate 


p RS id RAs PS nln 9 


std RS, DS(RA) 


MEM [ ( (RA) + (48DS\¢ Il DS;6 99 1170 .)), 8] — (RS) 


The std instruction stores the doubleword contained in register RS at the effective address of 
the instruction. The effective address of the instruction is found by adding the contents of RA 
to the sign extended 14-bit immediate field with two binary zeros concatenated to the right, 
unless RA is 0 (r0), in which case the effective address is just the sign extended immediate field 
with two binary zeros concatenated to the right. 


This instruction is available only on 64-bit implementations. 


stdcx. 
conditional store doubleword 


53 de RS i RA sls RB ius 214 ool bs 
stdcx. RS, RA, RB 


MEM [ ( (RA) + (RB) ), 8] <— (RS) 
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The stdcx. instruction stores the doubleword contained in register RS at the effective address 
of the instruction if and only if a reservation exists in the processor, and the address of the res- 
ervation corresponds to the address of this instruction. The effective address of the instruction 
is found by adding the contents of RA to the contents of RB, unless RA is 0 (r0), in which case 
the effective address is just the contents of RB. If the store is successful (that is the correct res- 
ervation exists), and CRo., is set to 0, then the reservation is cleared, and the EQ bit of CRO is 
set to 1. If the store is unsuccessful, the the EQ bit of CRO is set to 0 and CRg., is set to 0. The 
effective address of this instruction must be a multiple of 8 (doubleword aligned). 


If the reservation exists, but the address of the reservation does not correspond to the address 
of the instruction, it is undefined whether (RS) is stored into the doubleword in storage ad- 
dressed by EA. 


This instruction is defined only for 64-bit implementations. 


stdu 
store doubleword immediate with update 


AP AL A ee 


stdu RS, DS(RA) 
MEM [ ( (RA) + (48DS,¢ Il DSy6 59 1170), 8] — (RS) 


RA & (RA) + (DS; Il DSy¢ 99 1170 ) 


The stdu instruction stores the doubleword contained in register RS at the effective address of 
the instruction. The effective address of the instruction is found by adding the contents of RA 
to the sign extended 14-bit immediate field with two binary zeros concatenated to the right. 


In addition, the effective address of the instruction is loaded into RA. 
If RA=0 the instruction form is invalid. 
This instruction is available only on 64-bit implementations. 


stdux 
store doubleword with update 


Aare tee NC ce Pee 
stdux RS, RA, RB 
MEM [ ( (RA) + (RB) ), 8] — (RS) 


RA <— (RA) + (RB) 
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The stdux instruction stores the doubleword contained in register RS at the effective address of 
the instruction. The effective address of the instruction is found by adding the contents of RA 
to the contents of RB, unless RA is 0 (r0), in which case the effective address is just the con- 
tents of RB. If RA=0 the instruction form is invalid. 


In addition, the effective address of the instruction is loaded into RA. 


This instruction is available only on 64-bit implementations. 


stdx 


store doubleword 


pte PS ly RA ly RB igs 149 wl Ma 


stdx RS, RA, RB 


MEM [ ( (RA) + (RB) ), 8] — (RS) 


The stdx instruction stores the doubleword contained in register RS at the effective address of 
the instruction. The effective address of the instruction is found by adding the contents of RA 
to the contents of RB, unless RA is 0 (r0), in which case the effective address is just the con- 


tents of RB. 
This instruction is available only on 64-bit implementations. 


stfd 


store double precision floating-point immediate 
0 54 516 FRS 10 16 D 31 
stfd FRS, D(RA) 


MEM [ ( (RA) + (#9D,¢ Il Dig 3; )), 8] — (FRS) 


The stfd instruction stores the double precision floating-point value contained in floating-point 
register FRS at the effective address of the instruction. The effective address of the instruction 
is found by adding the contents of RA to the sign extended 16-bit immediate field, unless RA 
is O (r0), in which case the effective address is just the sign extended immediate field. 
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stfdu 


store double precision floating-point immediate with update 


oS 


stfdu FRS, D(RA) 


MEM [ ( (RA) + (!6D,¢ Il Dye 3, )), 8] — CRS) 


RA <— (RA) + (‘p,, I Boa) 


The stfdu instruction stores the double precision floating-point value contained in floating- 
point register FRS at the effective address of the instruction. The effective address of the in- 
struction is found by adding the contents of RA to the sign extended 16-bit immediate field. If 
RA=0 the instruction form is invalid. 


In addition, the effective address is loaded into register RA. 


stfdux 
store double precision floating-point with update 


p31 ade FRS gly RA ithe RB ln 759 al / 


stfdux FRS, RA, RB 
MEM [ ( (RA) + (RB) ), 8] — (FRS) 
RA — (RA) + (RB) 


The stfdux instruction stores the double precision floating-point value contained in floating- 
point register FRS at the effective address of the instruction. The effective address of the in- 
struction is found by adding the contents of RA to the contents of RB. If RA=0 the instruction 
form is invalid. 


In addition, the effective address is loaded into register RA. 


stfdx 


store double precision floating-point 
ee PA ce 
stfdx FRS, RA, RB 


MEM [ ( (RA) + (RB) ), 8] — (FRS) 
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The stfdx instruction stores the double precision floating-point value contained in floating- 
point register FRS at the effective address of the instruction. The effective address of the in- 
struction is found by adding the contents of RA to the contents of RB, unless RA is 0 (r0), in 
which case the effective address is just the contents of RB. 


stfiwx 


store floating-point as integer word 


oT de FRS ids RA ise RB ols 983d 


stfiwx FRS, RA, RB 


MEM [ ( (RA) + (RB) ), 4] — (FRS39.63) 


The stfiwx instruction stores the low-order 32 bits of the contents of floating-point register 
FRT at the effective address of the instruction without any conversion. The effective address of 
the instruction is found by adding the contents of RA to the contents of RB, unless RA is 0 
(r0), in which case the effective address is just the contents of RB. This instruction is optional. 


stfs 


store single precision floating-point immediate 


p52 FAS RAs 


stfs FRS, D(RA) 


MEM [ ((RA) + ('£D,¢ Il Dye 3,)), 4] — Convert_to_Single ( (FRS) ) 


The stfs instruction converts the contents of floating-point register FRS to single precision and 
then stores this single precision floating-point value at the effective address of the instruction. 
The effective address of the instruction is found by adding the contents of RA to the sign ex- 
tended 16-bit immediate field, unless RA is 0 (r0), in which case the effective address is just the 
sign extended immediate field. 


stfsu 


store single precision floating-point immediate with update 


p 53 FRS ul RA ie = 


stfsu FRS, D(RA) 


MEM [ ((RA) + ('6D,¢ Il Dyg 3,)), 4] — Convert_to_Single ( (FRS) ) 


RA & (RA) + (1D,, Il Dig 31) 
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stfsu FRT, D(RA) 


The stfsu instruction converts the contents of floating-point register FRS to single precision 
and then stores this single precision floating-point value at the effective address of the instruc- 
tion. The effective address of the instruction is found by adding the contents of RA to the sign 
extended 16-bit immediate field. If RA=0 then the instruction form is invalid. 


In addition, the effective address is loaded into register RA. 


stfsux 
store single precision floating-point with update 


p31 de FRSidy RA chs BB os 5d! 


stfsux FRS, RA, RB 


MEM [ ( (RA) + (RB) ), 4] — Convert_to_Single ( (FRS) ) 
RA <— (RA) + (RB) 


The stfsux instruction converts the contents of floating-point register FRS to single precision 
and then stores this single precision floating-point value at the effective address of the instruc- 
tion. The effective address of the instruction is found by adding the contents of RA to the 
contents of RB. If RA=0 then the instruction form is invalid. 


In addition, the effective address is loaded into register RA. 


stfsx 


store single precision floating-point 


pt de FRS ads PA dis BB gh 663d / 


stfsx FRS, Ry, Rz 


MEM [ ( (RA) + (RB) ), 4] — Convert_to_Single ( (FRS) ) 


The stfsx instruction converts the contents of floating-point register FRS to single precision 
and then stores this single precision floating-point value at the effective address of the instruc- 
tion. The effective address of the instruction is found by adding the contents of RA to the 
contents of RB, unless RA is 0 (r0), in which case the effective address is just the contents of 
RB. 
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sth 

store integer halfword 
oe eo re 
sth RS, D(RA) 


MEM [ ( (RA) + (!6D4¢ Il Dig 3 ))s 21 — R16. 31 


The sth instruction stores the least-significant halfword contained in RS at the effective ad- 
dress of the instruction. The effective address of the instruction is found by adding the con- 
tents of RA to the sign extended 16-bit immediate field, unless RA is 0 (r0), in which case the 
effective address is the sign extended immediate field. 


sthbrx 
store integer halfword with bytes reversed 


o 3H de RS idly RA iss RB isle 918d / 


sthbrx RS, RA, RB 


—_ 


MEM [ ( (RA) + (RB) ), 2] — ((RS)o4 31 Il (RS)16, 23) 


The sthbrx instruction stores the least-significant halfword contained in RS at the effective 
address of the instruction. The two bytes in the halfword are swapped before being stored. The 
effective address of the instruction is found by adding the contents of RA to the contents of 
RB, unless RA is 0 (r0), in which case the effective address is RB. 


sthu 
store halfword immediate with update 


o 45 RS ilu RAs 


sthu RS, D(RA) 
MEM [ ( (RA) + (6p., Il Dig. 31 )). 2]< (RS) 16,31 


RA & (RA) + (1D,, I Dig 31 ) 


The sthu instruction stores the least-significant halfword contained in RS at the effective ad- 
dress of the instruction. The effective address of the instruction is found by adding the con- 
tents of RA to the sign extended 16-bit immediate field. If RA=0 then the instruction form is 
invalid. 


In addition, the effective address of the instruction is loaded into RA. 
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sthux 

store halfword with update 
3 Re BA ee Py 5 
sthux RS, RA, RB 


MEM [( (RA) + (RB) ), 2] — (RS)16. 3, 


RA <— (RA) + (RB) 


The sthux instruction stores the least-significant halfword contained RS at the effective ad- 
dress of the instruction. The effective address of the instruction is found by adding the con- 
tents of RA to the contents of RB. If RA=0 then the instruction form is invalid. 


In addition, the effective address of the instruction is loaded into RA. 


sthx 

store halfword 
ee ee oe a ae 
sthx RS, RA, RB 


MEM [ ( (RA) + (RB) ), 2] — (RS),6. 3; 


The sthx instruction stores the least-significant halfword contained in RS at the effective ad- 
dress of the instruction. The effective address of the instruction is found by adding the con- 
tents of RA to the contents of RB, unless RA is 0 (r0), in which case the effective address is RB. 


stmw 
store multiple integer word 


TS ds De 


stmw RS, D(RA) 


MEM [ ( (RA) + (16D, 6 Il Dye 3, )), (4x 32-RT) ) ] © (RS), (Ry) 


The contents of registers RS through R3; are stored to consecutive words in memory starting 
at the effective address of the instruction. The effective address of the instruction is found by 
adding the contents of RA to the sign extended 16-bit immediate field, unless RA is 0 (r0), in 
which case the effective address is the sign extended immediate field. The effective address must 
be a multiple of 4 (word aligned). 
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On 64-bit implementations, only the low-order 32 bits of each register are stored. 


stswi 
store string immediate 


0 31 516 RS oa BO ake ale 725 30 hs 


stswi RS, RA, n 


MEM [ (RA), 7] < (RS) (R ¢ py) 


Starting with RS, registers are stored (only the low-order 4 bytes of 64-bit registers) to n con- 
secutive bytes in memory starting at the effective address of the instruction. The effective ad- 
dress of the instruction is the contents of RA, unless RA is 0 (r0), in which case the effective 
address is 0. Data is taken from as many consecutive registers as necessary to fill the number of 
bytes being stored to (r0 follows R;}). 


stswx 
store string 


stswx Rx, Ry, Rz 
(Ry ) + (Rz) 


MEM [ (Ry), XER}5 31] = (R,), (R (x ‘ XERs.1")) 


Starting with RS, registers are stored (only the low-order 4 bytes of 64-bit registers) to a num- 
ber consecutive bytes in memory starting at the effective address of the instruction. The 
number of bytes is specified in bits 25 through 31 of the XER. The effective address of the 
instruction is the contents of RA and the contents of RB, unless RA is 0 (r0), in which case the 
effective address is the contents of RB. Data is taken from as many consecutive registers as 
necessary to fill the number of bytes being stored to (r0 follows R3)). 


stw 
store integer word 


p36 RS PAs Ds 
stw RS, D(RA) 


MEM [ ( (RA) + (1,6 Il Dig 3, )), 41 — RS) 
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The stw instruction stores the word contained RS at the effective address of the instruction. 
The effective address of the instruction is found by adding the contents of RA to the sign ex- 
tended 16-bit immediate field, unless RA is 0 (r0), in which case the effective address is the 
sign extended immediate field. 


stwbrx 


store integer word with bytes reversed 


9 Bt ade RS sols RA sl PB ashes 8? el 


stwbrx RS, RA, RB 


MEM [ ( (RA) + (RB ), 4] — ((RS)o4, 41 HRS) 46, 93 I (RS)g, 45 HH Ro, 7) 


The stwbrx instruction stores the word contained RS at the effective address of the instruction. 
The order of the four bytes in the word is reversed before the word is stored. The effective 
address of the instruction is found by adding the contents of RA to the contents of RB, unless 
RA is O (r0), in which case the effective address is RB. 


stwu 
store word immediate with update 


aE OAL eee eee 


stwu RS, D(RA) 





MEM [( (RA) + (19D,¢ Il Dig 3, )), 41 — (RS) 


RA & (RA) + (1D, II Digi? 


The stwu instruction stores the word contained in RS at the effective address of the instruc- 
tion. The effective address of the instruction is found by adding the contents of RA to the sign 
extended 16-bit immediate field. If RA=0 then the instruction form is invalid. 


In addition, the effective address of the instruction is loaded into RA. 


stwux 
store word with update 


ee FY OP ee 


stwux RS, RA, RB 





MEM [ ( (RA) + (RB) ), 4] — (RS) 


RA — (RA) + (RB) 
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The stwux instruction stores the word contained in RS at the effective address of the instruc- 
tion. The effective address of the instruction is found by adding the contents of RA to the 
contents of RS. If RA=0 then the instruction form is invalid. 


In addition, the effective address of the instruction is loaded into RA. 


stwx 
store word 


gt de RS ids RA iss RB she 15! 


stwx RS, RA, RB 


MEM [ ( (RA) + (RB) ), 4] <— (RS) 


The stwx instruction stores the word contained in RS at the effective address of the instruc- 
tion. The effective address of the instruction is found by adding the contents of RA to the 
contents of RB, unless RA is 0 (r0), in which case the effective address is RB. 


stwcx. 
conditional store word 


0 31 i ee RA alin RB slic 150 30 I, 


stwcx RS, RA, RB 


MEM [ ( (RA) + (RB) ), 4] <— (RS) 


The stwex. instruction stores the word contained in register RT at the effective address of the 
instruction if and only if a reservation exists in the processor, and the address of the reservation 
corresponds to the address of this instruction. The effective address of the instruction is found 
by adding the contents of RA to the contents of RB, unless RA is 0 (r0), in which case the 
effective address is just the contents of RB. If the store is successful (that is the correct reserva- 
tion exists), then the reservation is cleared, and the EQ bit of CRO is set to 1. If the store is 
unsuccessful, the the EQ bit of CRO is set to 0. The effective address of this instruction must 
be a multiple of 4 (word aligned). 
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sub[o] [.] 
subtract 


pt de RT idly RB ide BA fale 40 al Rey 


sub[o][.] RT, RA, RB 


RT — (RA) — (RB) 


The sub[o][.] instruction subtracts the contents of RB from the contents of RA and places the 
result in RT. This is an extended mnemonic for the subtract from (subf[o][.]) instruction: 


subf[o][.] RT, RB, RA 


If the subo[.] form of the instruction is used, then the overflow and summary overflow bits of 
the XER are set if overflow occurs. 


If the sub[o]. form of the instruction is used, then CRO is updated. 


subc[o}][.] 

subtract carrying 
3) de BT i BP ahs BAO 8 Roy, 
subc[o][.] RT, RA, RB 


RT — (RA) — (RB) 


The subc[o][.] instruction subtracts the contents of RB from the contents of RA and places the 
result in RT. This instruction sets the XERCA bit. This is an extended mnemonic for the sub- 
tract from (subfc[o][.]) instruction: 


subfc[{o][.] RT, RB, RA 


If the subco[.] form of the instruction is used, then the overflow and summary overflow bits of 
the XER are set if overflow occurs. 


If the subc[o]. form of the instruction is used, then CRO is updated. 


subf[o] [.] 

integer subtract from 
ee ee os eee 
subf[o][.] RT, RA, RB 


RT — (RB) — (RA) 
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The subf[o][.] instruction subtracts the contents of RA from the contents of RB and places the 
result in RT. 


If the subfo[.] form of the instruction is used, then the overflow and summary overflow bits of 
the XER are set if overflow occurs. 


If the subf[o]. form of the instruction is used, then CRO is updated. 


subfc[o] [.] 
integer subtract from carrying 


ES Pe 


subfc[o][.] RT, RA, RB 


RT — (RB) — (RA) 


The subfc[o][.] instruction subtracts the contents of RA from the contents of RB and places 
the result in RT. This instruction sets the XERCA bit. 


If the subfco[.] form of the instruction is used, then the overflow and summary overflow bits 
of the XER are set if overflow occurs. 


If the subfc[o]. form of the instruction is used, then CRO is updated. 


subfe[o] [.] 
integer sutract from extended 


subfe[o][.] RT, RA, RB 


RT — (RB) + (RA) + XERc, 


The subfe[o][.] instruction subtracts the contents of RA from the contents of RB and uses the 
carry bit from the XER (XER¢a) as a borrow from the subtraction XERc,a==1, means no bor- 
row. (XERca==0 means borrow.) The result is placed into register RT. This instruction sets 
the XERCA bit. 


If the subfeo[.] form of the instruction is used, then the overflow and summary overflow bits 
of the XER are set if overflow occurs. 


If the subfe[o]. form of the instruction is used, then CRO is updated. 
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subfic 3 
integer subtract from (immediate addressing, carrying) 


p 8d RT idly PA sh : : 


subfic[.] RT, RA, si 


RT & ('sixg Il Siyg 3, ) — (RA) 


The subfic instruction subtracts the contents of RA from the sign extended 16-bit immediate 


field, placing the result into RT. The XERCA bit is updated with the carry out of the addition. 


subfme[o] [.] 
subtract from minus one (extended) 


ote RT ds PA she / wll 23? _ sl Rs 


subfme[o][.] RT, RA 


RT — (RA) + XERc, -1 


The subfme[o]|.] instruction subtracts the contents of RA from -1 and uses the carry bit from 
the XER (XER¢a) as a borrow from the subtraction XERca==1, means no borrow. (KERc,a==0 
means borrow.) The result is placed into register RT. This instruction sets the XERCA bit. 


If the subfmeo].] form of the instruction is used, then the overflow and summary overflow bits 
of the XER are set if overflow occurs. 


If the subfme[o]. form of the instruction is used, then CRO is updated. 


subfze[o}][.] 
subtract from (zero extended) 


pte RT dy RA ssh / sf Prde 200 ley 


subfze[o][.] RT, RA 


RT <— (RA) + XERc, 


The subfze[o][.] instruction subtracts the contents of RA from 0. The XERCA bit is a carry 
(borrow) into the subtraction. The result is placed into register RT. This instruction sets the 
XERCA bit. 


If the subfzeo[.] form of the instruction is used, then the overflow and summary overflow bits 
of the XER are set if overflow occurs. 
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If the subfze[o]. form of the instruction is used, then CRO is updated. 


subi 
subtract immediate 


ge RT oly RA ig value 


subi RT, RA, value 


RT <— (RA) — value 


The subi instruction subtracts the sign extended immediate field from the contents of register 
RA. This is an extended mnemonic for the add immediate (addi) instruction: 


addi RT, RA, -value 


subis 
subtract immediate shifted 


pd PT ls BA ve Hy 


subis RT, RA, value 


RT < (RA) — (value || '60 ) 


The subis instruction subtracts the immediate field shifted left by 16 bits from the contents of 
register RA. This is an extended mnemonic for the add immediate (addis) instruction: 


addis RT, RA, -value 


subic[.] 
subtract immediate carrying 


subic RT, RA, value 


RT < (RA) — value 


The subic[.] instruction subtracts the sign extended immediate field from the contents of reg- 
ister RA. The carry (borrow) out of the subtraction updates the XERCA bit. This is an ex- 
tended mnemonic for the add immediate (addic) instruction: 


addic RT, RA, -value 
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If the Re bit is set (subic.), then CRO is updated. 


sync 
synchronize 


31 /// /// /// 598 / 


0 516 10} 11 15}16 20421 30 31 


sync 


The sync instruction forms a fence between operations. Operations that occur before the sync 
instruction are completed with respect to all other operations before operations following the 
sync are initiated and storage accesses are performed with respect to all other processors and 
mechanisms. 


For weaker ordering, see the eieio instruction. 


td 
trap doubleword 


ge a es ee ead 


td TO, RA, RB 


The td instruction compares the contents of register RA to the contents of RB. Five conditions 
are checked: 


@ signed less than 

@ signed greater than 
@ equal to 

M@ unsigned less than 

M@ unsigned greater than 


These conditions are masked by the TO field in the instruction (see Table B.7), and if any true 
condition corresponds to a one bit in the TO field, then the system trap handler is invoked. 


This instruction is defined only for 64-bit implementations. 


There are a set of extended mnemonics associated with the trap instructions (see Tables B.7 


and B.8). 


tdi 
trap doubleword immediate 


Tee oat ee, See 


tdi TO, RA, si 





The tdi instruction compares the contents of register RA to the sign extended immediate field. 


Five conditions are checked: 


M@ signed less than 


M@ signed greater than 


M@ equal to 


M@ unsigned less than 


M@ unsigned greater than 


These conditions are masked by the TO field in the instruction (see Table B.7), and if any true 
condition corresponds to a one bit in the TO field, then the system trap handler is invoked. 


This instruction is defined only for 64-bit implementations. 


There are a set of extended mnemonics associated with the trap instructions (see Tables B.7 


and B.8). 
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Table B.7. TO field encoding for extended mnemonics. 





Code 
It 
le 
eq 
ge 
gt 
nl 
ne 
ng 
It 
lle 


<none> 


Definition 

less than 

less than or equal to 
equal to 

greater than or equal to 
greater than 

not less than 

not equal to 

not greater than 
logically less than 


logically less than or 
equal to 


logically greater than 
or equal to 


logically greater than 
logically not less than 
logically not greater than 


unconditional 


Decimal 
TO 


16 
20 
4 
12 
8 
12 
24 
20 
2 
6 


31 
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Table B.8. Extended mnemonics for trap instructions. 


Instruction 
Semantics 


trap 
unconditionally 


trap if less than 


trap if less than or 


equal to 

trap if equal 

trap if greater than 
or equal to 

trap if greater than 
trap if not less than 
trap if not equal to 
trap if not greater 
than 

trap if logically 
less than 

trap if logically 
less than or 

equal to 

trap if logically 
greater than or 
equal to 

trap if logically 
greater than 

trap if logically 
not less than 

trap if logically 


not greater than 


tlbia 
TLB invalidate all 


32-Bit Trap Instructions 


tw 


trap 


twlt 
twle 


tweq 
twge 
twet 
twnl 


twne 
twng 


twllt 


twlle 


twlge 


twlgt 
twinl 


twlng 


tw 


twlti 


twlei 


tweqi 
twgel 
twegti 
twnli 


twnei 


twngi 
twllti 


twllei 


twlgei 


twlgti 
twlnli 


twlngi 


64-Bit Trap Instructions 


td 


tdlt 
tdle 


tdeq 
tdge 
tdgt 
tdnl 


tdne 
tdng 


tdllt 


tdlle 


tdige 


tdigt 
tdinl 


tding 


tdi 


tdlti 
tdlei 


tdeqi 
tdgei 


tdgti 
tdnli 


tdnei 
tdngi 


tdllti 


tdllei 


tdigei 


tdigti 
tdInli 


tdingi 


ls 


tibia 
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The tlbia instruction invalidates all TLB entries in all processors (see Appendix C). 


This instruction is privileged and is optional. 


tlbie 

TLB invalidate entry 
Bs 
tlbie RA 


The tlbie instruction invalidates a TLB entry in all processors (see Appendix C). The effective 
address of the instruction is the contents of RA, and if there is a TLB entry associated with that 
address in the processor (or any other processor), then that entry is invalidated. The invalida- 
tion is done without reference to the segment registers or SLB. All matching entries are invali- 


dated. 


This instruction is optional and is privileged. 


tlbsync 

TLB synchronize 
0 31 516 MI 10} 11 Hl 15} 16 iI 20421 566 30 / 31 
tibsync 


The tlbsync instruction forces all previous tlbie and tlbia instructions to complete on all pro- 
cessors before instructions following the tlbsync instruction can initiate (see Appendix C). 


This instruction is optional and is privileged. 


tw 
trap word 


yt de TO RA dy RB ils 4 nd 


tw TO, RA, RB 
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The tw instruction compares the contents of register RA to the contents of RB. Five condi- 
tions are checked: 

M@ signed less than 

@ signed greater than 

M@ equal to 

@ unsigned less than 

M@ unsigned greater than 


These conditions are masked by the TO field in the instruction (see Table B.7), and if any true 
condition corresponds to a one bit in the TO field, then the system trap handler is invoked. 


There are a set of extended mnemonics associated with the trap instructions (see Tables B.7 
and B.8). 

twi 

trap word immediate 


fae a ae ee 


twi TO, RA, si 


The twi instruction compares the contents of register RA to the sign extended immediate field. 
Five conditions are checked: 

M@ signed less than 

M@ signed greater than 

M equal to 

@ unsigned less than 

@ unsigned greater than 


These conditions are masked by the TO field in the instruction (see Table B.7), and if any true 
condition corresponds to a one bit in the TO field, then the system trap handler is invoked. 


There are a set of extended mnemonics associated with the trap instructions (see Tables B.7 
and B.8). 


xor[.] 


logical XOR 


g 31d PA sgly RT sche Be 316 alRe, 


xor[.] RT, RA, RB 


_ 


RT — (RA) @ (RB) 
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The xor[.] instruction performs a logical XOR of the contents of register RA with the contents 
of register RB. The result is then stored into register RT. 


If the Rc bit is set (xor.), then CRO is updated. 


xori 

logical XOR immediate 
» 6 RA 
xori RT, RA, ui 


RT — (RA) @ (*0 II ui ) 


The xori instruction performs a logical XOR of the low-order 16 bits of the contents of register 
RA with the immediate field ui; the high-order 16 bits of RA are unchanged. The result is then 
stored into register RT. 


xoris 


logical XOR shifted immediate 


RA RTs 


xoris RT, RA, ui 
RT — (RA) ® (ui 11 '90) 
The xoris instruction performs a logical XOR of the high-order 16 bits of the contents of reg- 


ister RA with the immediate field ui; the low-order 16 bits of RA are unchanged. The result is 
then stored into register RT. 
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This appendix covers parts of the PowerPC processor and system-management architecture 
that will be most interesting to OS and systems programmers, such as the instructions and 
architectural features designed for implementing security, virtual memory and protection, in- 
terrupts, and interprocess communication and synchronization. 


Machine State Register 


The machine state register (MSR) is the main processor mode register; bits in this register se- 
lect supervisor mode, enable and disable some interrupts, and determine how the processor 
reacts to some system events. The MSR defines the state of the processor. The MSR is 32-bits 
wide in 32-bit PowerPC implementations (like the 601, 603, and 604 processors), and 64 bits 
in 64-bit implementations (like the PowerPC 620 processor). At this time, the only difference 
between the 32-bit and 64-bit MSR is a bit in the upper-order 32 bits that selects 64-bit mode. 
Therefore, a discussion of the 32-bit MSR, which is identical to the low-order 32 bits of the 
64-bit MSR, is sufficient. Figure C.1 shows the format of the MSR. The following list describes 
the bits in the MSR: 


M@ Power Management Enable (POW) 


If this bit is set to 1 and the processor has hardware to perform dynamic power 
management, then the dynamic power management is enabled. The actual power- 
management hardware is implementation dependent (see the User’s Manual for the 
particular implementation you are using). 

@ Interrupt Little Endian (ILE) 


When an interrupt occurs, this bit is copied into the LE bit (this is described later in 
this section) to select the endian mode of the interrupt handler. 

M@ External Interrupt Enable (EE) 
If this bit is set to 1, then external and decrementer interrupts are allowed to occur 
(see “PowerPC Interrupt Architecture,” later in this appendix). 

M Problem State (PR) 


This bit determines whether the processor is allowed to execute privileged in- 
structions. 


@ Floating-Point Available (FP) 


If this bit is set to 1, the processor is allowed to execute floating-point instructions. 
Otherwise, a program interrupt occurs if the processor attempts to execute a floating- 
point instruction. The operating system may use this bit to determine whether 
floating-point registers need to be saved during a state save operation. 


M™@ Machine Check Enable (ME) 
If this bit is set to 1, machine check interrupts are allowed to occur. 
@ Floating-Point Exception Modes 0 and 1 (FEO, FE1) 


These bits determine how floating-point interrupts occur. They are described in more 
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detail later in this appendix, with interrupts and with the floating-point execution 
model. 


M@ Single-Step Trace Enable (SE) 


This bit enables the single step trace mode (which is implementation dependent). For 
details about this function, see the User’s Manual for the particular processor you are 
using. 

M Branch Trace Enable (BE) 


This bit enables the branch trace mode (which is implementation dependent). For 
details about this function, see the User’s Manual for a specific PowerPC processor. 


M@ Interrupt Prefix (IP) 
This bit determines the high-order bits of the interrupt address that are generated 
when an interrupt occurs (see “PowerPC Interrupt Architecture,” later in this ap- 
pendix). 

M@ Instruction Relocate (IR) 
If this bit is set, the processor performs address translation on all instruction fetch 


accesses (see “Virtual Memory,” later in this appendix). If the bit is 0, then the 
untranslated effective address is used as the physical address. 


M Data Relocate (DR) 
If this bit is set to 1, then the processor performs address translation on all data 
accesses (see “Virtual Memory,” later in this appendix). If the bit is 0, then the 
effective address is used as the physical address. 

M Recoverable Interrupt (RI) 
This bit determines whether certain interrupts are recoverable. 

M@ Little endian mode (LE) 


This bit determines which endian mode is used in the processor for all memory 
accesses (see Appendix D, “A Detailed Floating-Point Model”). 


FIGURE C.1. o_Reserved 2] POW sa} IS" 4} ILE ro} EE so] PR 17] FP ya} ME o] FEO no] SE os 


The machine state register 


clin SS BE za] FEt 25] AS" ax] IP oo] 26] OR 27 | ze Reserved oo] Rl so] LE sy 


“Implementation Dependent 
*“Reserved 


Problem State 


Two problem states are defined in the PowerPC architecture. When the MSR,,, bit is one, certain 
instructions cannot be executed—these are called privileged instructions, and the mode is called 
non-privileged mode. When the MSR,,, bit is 1, all instructions can be executed. This mode is 
called privileged or kernal mode . Generally, an operating system sets this bit to one before 
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returning control to a user program. This setting enables the operating system to protect criti- 
cal resources (such as the MSR, or the virtual memory resources). When an interrupt occurs, 
control is passed back to the operating system, and this bit is set to zero in order to enable the 
interrupt handler to run in privileged mode. Another important feature of these problem states 
is that memory protection can be assigned differently for each of the states (see “Virtual 
Memory,” later in this appendix). 


Special-Purpose Registers 


In Chapters 3 and 4, you learned about the move to special-purpose register (mtspr) and the move 
from special-purpose register (mfspr) instructions, and how you can used them to move data 
between the general integer registers and the link register (LR), the count register (CTR), and 
the integer exception register (CER). These instructions can be used to manipulate other 
special-purpose registers—there are 34 that may be written and 33 that can be read. The gen- 
eral form of these instructions follows: 


mtspr SPR, RS 
mfspr RT, SPR 


The SPR field is a 10-bit field that identifies the special-purpose register. The RS and RT fields 
identify the general integer register from which data is read or into which data is stored, respec- 
tively. Table C.1 shows a complete list of the special-purpose registers. 


Table C.1. Special-purpose registers. 





Special-Purpose Encoding Read] 
Register Decimal Hexadecimal Privileged Write 
Integer Exception l 0x001 no R/W 

Register (XER) 

Link Register (LR) 8 0x008 no R/W 

Count Register (CTR) 9 0x009 no R/W 

Data Storage 18 0x012 yes R/W 

Interrupt Status 

Register (DSISR) 

Data Address 19 0x013 yes R/W 

Register (DAR) 

Decrementor (DEC) 22 0x016 yes R/W 

Storage Descriptor 25 0x019 yes R/W 

Register 1 (SDR1) 

Machine Status 26 OxO1A yes R/W 

Save/Restore 


Register 0 (SRRO) 
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Special-Purpose 
Register 


Machine Status 
Save/Restore 


Register 1 (SRR1) 


Software Use SPR 
0 (SPRGO) 


Software Use SPR 
1 (SPRG1) 


Software Use SPR 
2 (SPRG2) 


Software Use SPR 
3 (SPRG3) 


External Access 
Register (EAR) 
Time Base Lower 


(TBL) 


Time Base Upper 
(TBU) 


Processor Version 
Register (PVR) 
Instruction Block 
Address Translation 
Upper Register 0 
(IBATOUV) 


Instruction Block 
Address Translation 
Lower Register 0 
(IBATOL) 


Instruction Block 
Address Translation 
Upper Register 1 
(IBAT1U) 


Instruction Block 
Address Translation 
Lower Register 1 


(IBAT1L) 
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Encoding 
Decimal 


2/ 

272 
He 
274 
YI 
282 
284 
285 
287 


528 


529 


530 


531 


Hexadecimal 


0x01B 


0110 


Ox111 


0Ox112 


0x113 


Ox11A 


0x11C 


0x11D 


Ox11F 


0x210 


0x211 


0Ox212 


0x213 


Privileged 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 


yes 


yes 


yes 


yes 


Read/ 
Write 


R/W 


R/W 


R/W 


R/W 


R/W 


R/W 


R/W 


R/W 


R/W 


R/W 


continues 
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Table C.1. continued 


Special-Purpose 
Register 

Instruction Block 
Address Translation 


Upper Register 2 
(IBAT2U) 


Instruction Block 
Address Translation 
Lower Register 2 
(IBAT2L) 


Instruction Block 
Address Translation 
Upper Register 3 
(IBAT3U) 


Instruction Block 
Address Translation 
Lower Register 3 
(IBAT3L) 


Data Block Address 
Translation Upper 
Register 0 (DBATOU) 
Data Block Address 
Translation Lower 
Register 0 (DBATOL) 
Data Block Address 
Translation Upper 
Register 1 (DBAT1U) 
Data Block Address 
Translation Lower 
Register 1 (DBAT1L) 
Data Block Address 
Translation Upper 
Register 2 (DBAT2U) 
Data Block Address 
Translation Lower 
Register 2 (DBAT2L) 
Data Block Address 
Translation Upper 
Register 3 (DBAT3U) 


Encoding 
Decimal 


532 


533 


534 


p Je by 


536 


537 


538 


539 


540 


541 


542 


Hexadecimal 
0x214 


0x215 


0x216 


0217 


0x218 


0x219 


Ox21A 


0x21B 


0x21C 


0x21D 


Ox21E 


Privileged 
yes 


yes 


yes 


yes 


yes 


yes 


yes 


yes 


yes 


yes 


yes 





Read/ 
Write 


R/W 


R/W 
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Special-Purpose Encoding Read/ 
Register Decimal Hexadecimal Privileged Write 
Data Block Address . 543 0x21F yes R/W 


Translation Lower 


Register 3 (DBAT3L) 


Data Address 1013 Ox3F5 yes R/W 
Breakpoint Register 
(DABR) 


The XER, LR, and CTR registers were described earlier with the integer and branch instruc- 
tions. The DSISR, DAR, SRRO, SRR1, TBL, TBU, and DABR registers are described with 
the interrupt architecture (see below). The SDR1, EAR, 8 DBAT, and 8 IBAT registers are 
described in the “Virtual Memory” section. This section covers the remaining four SPRG reg- 
isters and the PVR register. 


The Software-use SPRs (SPRGO-4) are registers with no specific architectural meaning. These 
registers are provided as privileged registers reserved for operating system use—such as scratch 
space for an interrupt handler, or to hold information unique to each processor in a multipro- 
cessor system. 


The Processor Version register (PVR) is a read-only register that returns a 32-bit value identi- 
fying the processor implementation. This 32-bit value is split into two 16-bit pieces. The 
upper-order 16-bits are a version number identifying a particular processor number (601, 603, 
and so on). The lower-order 16 bits identifies a revision number for the particular processor 
(engineering change level). 


Virtual Memory 


As described in Chapter 2, “Introduction to the PowerPC Architecture,” virtual memory is a 
method of increasing the apparent size of memory by using main memory (typically DRAM 
memory) as a cache for a larger storage space (such as disk storage). (See the cache descriptions 
in Chapter 7, “Performance Tuning and Optimization.”) You might create a virtual 1GB 
memory space on a machine with only 4MB of DRAM memory, for example, by using a 1GB 
hard drive. Whenever a piece of data is requested by a program, the DRAM memory is checked 
first, and if the data is there, it is accessed. If the data is not there, however, the data is read 
from the hard drive and placed into the DRAM memory for future use (some other piece of 
data is displaced from the DRAM memory and put back into the hard drive). The hard drive 
space is referred to as paging space. Why do all this work instead of simply using the hard drive 
as the only memory device? In a word, speed. Access time to DRAM memory is measured in 
nanoseconds (ns); access time to a hard drive is many orders of magnitude slower, and is mea- 
sured in milliseconds (ms). The DRAM memory is used as a cache for the much larger, slower 
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paging space device (usually a hard drive). Caches are described in Chapter 7. This particular 
cache is controlled by software through a number of mechanisms contained in hardware. In 
this section, the interface to those mechanisms is described. 


Virtual memory management generally is in the realm of the operating system, and in the 
PowerPC architecture, most of the instructions used to manage the virtual memory space are 
privileged, meaning that user programs cannot directly execute the instructions. The PowerPC 
architecture supports two memory-management schemes: a paging system and a block address 
translation system. The paging system is discussed first, followed by the block address transla- 
tion (BAT) mechanism. (The BAT mechanism enables the programmer to map larger pieces 
of memory with a single translation resource.) There are differences between the 32-bit and 
the 64-bit virtual memory mechanisms, so these are covered separately in the following 
sections. 


32-Bit Page Translation Mechanism 


This section discusses virtual memory management on 32-bit implementations of the PowerPC 
architecture. The 64-bit virtual memory architecture is discussed in “64-Bit Virtual Memory 
Architecture,” later in this appendix. Recall from Chapter 2 that the PowerPC architecture has 
a 52-bit virtual address space, which is split into 274 256MB pieces called segments. Each of 
these segments is broken down further into 2'° 4KB chunks called pages. Pages are the granule 
of memory that be moved from the virtual storage device (a hard drive, for example) to the 
physical storage device (DRAM, for example). 


The process uses a 32-bit address that maps to a portion of the virtual address space, which is 
accessible only by that process (ignoring shared memory spaces for the moment). This address 
is called an effective address, and it is all a programmer normally needs to see. Figure C.2 shows 
a simplified view of address translation. 


The first structure associated with the translation mechanism is the segment register. The PowerPC 
architecture defines 16 segment registers; each contains information about a memory segment 
and forms the set of active segments in the processor. The most significant 4 bits of the effec- 
tive address are used to select one of the 16 active segments. The segment register contains 
several pieces of information, including a virtual segment ID that is a 24-bit value used to 
uniquely identify the segment being accessed in virtual memory. The 24-bit segment ID with 
the remainder of the effective address (the low-order 28 bits) concatenated to it forms the 52- 
bit virtual address. Next, the virtual address must be translated into a physical address. This 
process involves determining where in physical memory the virtual page for this address cur- 
rently resides If we assign 16 VSIDs to each process, this results in 1 million processes each 
with a 32-bit effective address space. 


The next structure associated with address translation is the page table. The page table contains 
the mapping of virtual pages to physical pages. When a page is read from the virtual memory 


NSTI SS RL RE BE ILS 


<. 


5 sipdy tac gl Wine ies oh (oka eles ihe 


ERS SITS er ee 


Fi SEES TRB TS STG 


FERETITE Y IEt 


REN TTS 


fa > 





fo eR a PR ESI ON DRE STS EE EU EIEN FES Te GREE BOOTIES TE LOE EL PEAR ATS POT ERY ONES a PON 


ee a es we 


EEE 


Appendix C Wi Operating System Design for PowerPC Processors 


storage device and placed into the physical memory storage device, the page table is updated to 
reflect the new mapping (it lists the virtual page currently contained in each physical page). 
The page table is searched using a hash function (described in “Page Table Construction,” later 
in this appendix). The virtual segment ID is used with the page index (bits 4 through 19 of the 
effective address) to access the page table, resulting in a 20-bit real page number. The real page 
number has the page offset (bits 20 through 31 of the effective address) concatenated to it to 
form the physical address. The physical address then is used to access system memory. 


FIGURE C.2. Segment number 


translation. 
Segment Registers 


r Virtual Segment ID o3 log  Pagelndex 4 o]/4, Page Offset ,, 


Page Table 
9 RealPage Number  ,,] 4, Page Offset ,, 





Page Properties 


Different pages in memory can have different properties. The properties of a page are specified 
in the segment register or the page table, and generally are set by the operating system when 
the page is set up (initialized or brought in from paging space). 


Page Protection 


The first property that pages have is their page protection settings. Page protection settings can 
be different, depending on whether the processor is in privileged (operating system) mode or 
nonprivileged (user) mode. Three types of access permissions are defined by the PowerPC ar- 
chitecture: 


Mi No access. Neither loads nor stores are allowed to occur. 

M@ Read-only access. Loads are allowed, but stores are not allowed. 

M Read/write access. Both loads and stores are allowed to occur. 
An operating system therefore may choose to control access to some critical system resource by 
setting the user code permission to No Access, while retaining access rights for itself of Read/ 
Write Access. The user code might have to perform a system call to ask the operating system to 


read or write that system resource. Another possibility is that the operating system may give 
one process read/write access, while denying access to all other processes. This could be done 
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by setting the page protection properties one way while that process is running, but changing 
them when a different process gains control of the CPU. 


Cache Policy 


Another property that can be defined on a per page basis is how a cache will behave for accesses 
to that page. Some programs (especially in a multiprogramming environment) rely on certain 
specific cache behavior. Instructions can be used to force some of this behavior, but this pro- 
cess can be cumbersome or impossible in some cases. The PowerPC architecture therefore defines 
a set of properties on a page granularity to force certain cache behavior. 


A page can be set to write-through mode so that accesses to that page always behave as if all 
levels of cache in the system are write-through caches. This mode means that read accesses to 
that page behave normally, but stores always propagate out to main memory at the time they 
occur. If a program requires data in main memory to match data in any caches in the system, 
the pages containing that data should be set to write-through mode. 


A page can be set to cache-inhibited mode. In this mode, data accesses to that page behave as if 
there were no caches in the system at all. In this case, when data is read or written, the access is 
performed only in main memory, and no data is placed in any cache. An example where cache- 
inhibited mode would be used is in memory-mapped I/O. For memory-mapped I/O, reading 
the same location in main memory may result in different data (because the memory location 
is mapped to some device rather than to a standard memory area). In this case, it would be bad 
to cache the data from a read, because a second read to the same address should return new 
data from the device rather than the old data (cached from the previous read). Thus, it is im- 
portant to disable all caches for these accesses. 


A problem can arise when multiple processing devices try to access the same data in main 
memory. Some devices that may cause this problem to arise are two microprocessors in a mul- 
tiprocessing system, or a microprocessor and a direct memory access (DMA) controller in a 
uniprocessor system. The problem is that one device may have data cached locally (in a cache 
that is not shared by the other device), in which case the other device will get old (stale) data 
when it performs a read access to that data. To solve this problem, the PowerPC architecture 
defines a page property of memory coherence, which tells the hardware to make sure that all the 
different devices in the system see the most recent copy of the data within that page. This is 
discussed in more detail in “Multiprogramming and Multiprocessor Synchronization,” later 
in this chapter. 


The final property that pages can have is related to caches but not unique to cache behavior. 
Often, in modern processors, data is fetched before it is known to be needed. This generally is 
done invisibly so that the programmer need not be concerned with it, but there are cases where 
it may not be invisible to the programmer. If a load is executed speculatively, for example, and 
the address loaded happens to be a memory-mapped I/O device, the results can be disastrous. 
The device very well may change its data because of the read, and the processor may discover 
that the load should not have executed and throw away the data it read. That data now is gone 
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forever. As a solution, the PowerPC architecture defines a property for pages called guarded 
storage. If a page is guarded, data cannot be speculatively read from it. Basically, any load that 
has store semantics (loads to idempotent storage) should be placed in guarded storage. 


Block Translation Mechanism 


The block address translation (BAT) mechanism in the PowerPC architecture enables the pro- 
grammer to map a larger than 4-KB piece of memory using a single translation resource. The 
BAT mechanism uses special-purpose registers, which are entirely software controlled, in or- 
der to perform the translation. Each translation resource consists of two BAT registers: an upper 
register and a lower register (see “Special-Purpose Registers,” earlier in this appendix). The BAT 
mechanism is a single-level translation. The upper-order effective address bits are compared 
against some bits in the upper BAT register (the block effective page index), and if there is a 
match, the upper-order bits of the real address are read from the lower BAT register (the block 
real page number). The PowerPC architecture includes four sets of instruction BAT registers 
(each set consists of an upper and a lower register) and four sets of data BAT registers. 


Not only does the BAT mechanism enable you to map larger spaces, but it also enables you to 
map different-size spaces. You therefore could map one area (a video frame buffer, for example) 
into a 4-MB BAT space, and another area (the operating system kernel, for example) into a 
1MB BAT space. Each of these mappings would consume one BAT register (as opposed to 
1,024 and 256 page table entries each, respectively). The upper BAT register contains an 11- 
bit mask that is used to select the page size. Table C.2 shows all the possible sizes of memory 
blocks which can be mapped to a BAT entry. 


Table C.2. Allowable BAT sizes/BAT mask values. 





BAT Size BAT Mask 

128 KB 000 0000 0000 
256 KB 000 0000 0001 
512 KB 000 0000 0011 
1 MB 000 0000 0111 
2 MB 000 0000 1111 
4 MB 000 0001 1111 
8 MB 000 0011 1111 
16 MB 000 0111 1111 
32 MB 000 1111 1111 
64 MB 001 1111 1111 
128 MB 011 11111111 


256 MB 


Lid 3401331 
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The BAT mechanism supports the same page protection and cache behavior bits as the page 
mechanism. 


32-Bit Virtual Memory Management 


This section covers some of the issues associated with managing virtual memory on a 32-bit 
implementation of the PowerPC architecture. This memory management generally is handled 
by the operating system. 


Segment Registers 


The segment registers are accessed whenever the processor needs to translate an address. The 
format of the segment register is shown in Figure C.3. The fields of a segment register follow: 


é This field selects between two formats for the remaining bits in the 
segment register. If T=0, this is a normal segment (this is the format 
described here). If T=1, this is a direct store segment (see the “Direct Store 
Segments” section later in this appendix). 


K, This is the page protection key for operating system code (MSR,,, = 0). 
K, This is the page protection key for application code (MSR,, = 1). 
N If N=1, this segment is a no-execute segment. 


VSID This is the virtual segment ID. 





FIGURE C.3, 
The segment regiter format | [Ta]Kex]Kra]Na]eResered gf, SID 
(normal segments). 


Two instructions can be used to update segment registers. The move to segment register instruc- 
tion (mtsr) copies the contents of a general integer register into a segment register specified in 
the instruction. The move to segment register indirect instruction (mtsrin) copies the contents of 
a general integer register into a segment register specified in another general integer register. 


Similarly, two instructions can be used to read the segment registers. The move from segment 
register instruction (mfsr) copies the contents of a segment register specified in the instruction 
into a general integer register. The move from segment register indirect instruction (mfsrin) cop- 


ies the contents of a segment register (specified in a general integer register) into a general in- 
teger register. 


Synchronization requirements are associated with these instructions; these are specified in the 
section “Synchronization Requirements for Segment Registers, Page Table Entries, and Seg- 
ment Table Entries,” later in this appendix. 
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Page Table Construction 


The page table is a software-controlled table used by the processor during address translation. 
The table is organized into groups of eight entries, each of which is 64-bits long (see figure 
C.4). The page table is searched by hardware each time it needs to translate an address. As 
mentioned previously, the processor uses a hash function to look up page table entries. 


FIGURE C.4. 
64-bit page table entry: 
32-bit architecture. 





The result of the hash function is the real (physical) address of the page table entry group that 
should be searched for the translation. The most significant bits of this address come from a 
special-purpose register: storage description register 1 (SDR1). The high-order 16 bits of this 
register are called the hash table origin (HTABORG), and the least significant 9 bits of SDR1 
are called the hash table mask (HTABMASK). Between 7 and 16 of the high-order bits of the 
HTABORG field are used to identify the base address of the page table (see figure C.5). The 
minimum supported page table size is 64KB, and the maximum supported page table size is 


32MB. 


FIGURE C.5. 
, HTABORG + v3 HTABMASK _ ,, 


The organization of storage 


description register 1 * HTABORG - Real address of the page table. 
(SDR1) and formation of * HTABMASK - Mask for the page table address. 
the ‘a table site de la’ Page table group physical address generation: 


address. 
HTABORG 9.6 || (HTABORG 7.4, | (HTABMASK & HASH 9.g)) || HASH 9-4, || 66 
(where | | is concatenate & | is logical OR) 


* HASH - The 20 bit hash value. 


The index into the page table comes from a hash value generated using the effective address of 
the access being translated. Two hash values are generated: a primary hash and a secondary 
hash. These hash values, together with the values in SDR1, give the physical addresses of two 
page table entry groups, which are searched for the appropriate translation information (see 
figure C.5). The primary hash is generated by XORing the low-order 19 bits of the VSID (from 
the segment register) with three 0s concatenated with the page index (from the effective ad- 
dress). The secondary hash is the 1’s complement (logical inversion) of the primary hash. 


The page table entry group (PTEG) accessed from the primary hash is called the primary PTEG, 
and the PTEG accessed from the secondary hash is called the secondary PTEG. After the ad- 
dresses of the two page table entry groups have been calculated, these two groups are searched 
for translation information. Each of these groups consists of eight page table entries (see figure C.5). 
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A page table entry (PTE) for the current translation is found if the following conditions are 


met: 
@ PTE,, = 0 for the primary PTEG, and 1 for the secondary PTEG. 
@ PTE, =1. 


@ PTE, = VSID (from the segment register). 
@ PTE,,, = page index bits 4 through 9. 


After a valid PTE is found, the RPN from that PTE is concatenated with the page offset from 
the effective address to form the 32-bit physical address. 


If the page table lookup fails, then a page fault occurs and an instruction storage interrupt (for 
instruction fetch accesses) or a data store interrupt (for data accesses) is taken (see “PowerPC 
Interrupt Architecture,” later in this appendix). 


Example C.1 shows how a page table can be created. A page table entry contains the following 


fields: 


Valid (V) This bit specifies that the page table entry is valid 


(this field is set to 1 if this entry is valid). 
Virtual Segment ID (VSID) This field specifies the virtual segment for which 


this entry is used. 
Hash Function Identifier (H) This bit specifies whether the primary or second- 


ary hash function is used to access this entry. 


This field is matched with bits 4 through 9 of the 
effective address (the first 6 bits of the page index). 


Abbreviated Page Index (API) 


Inhibited, Coherent, Guarded 
(WIMG) 


Page Protection (PP) 


Reserved (RS) 


Real Page Number (RPN) This field gives the real page number where the 

| access will go in physical memory. 

Reference Bit (R) This bit is set by hardware when a page is accessed 
using this PTE. 

Change Bit (C) This bit is set by hardware if any data in this page 
has been changed by an access using this entry 
during translation. 

Write Through, Cache. These bits specify the cache behavior for this 


page, and whether the page is guarded. 


These bits specify the page protection for this 
page. 


These bits are reserved and should be set to 0. 
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Translation Lookaside Buffers 


In the preceding section, you learned how the processor accesses the page table when it wants 
to translate an address. If the processor had to access the page table in memory every time it 
wanted to perform an instruction fetch or data access, it would spend most of its time doing 
translation and very little of its time actually running your program. In order to solve this prob- 
lem, all current implementations of the PowerPC architecture contain special caches called trans- 
lation lookaside buffers (TLBs). These caches are used to cache PTE entries. When the operat- 
ing system changes a PTE entry, it must invalidate the TLB entry corresponding to that PTE 
so that the processor does not use the old translation. 


Two instructions are supplied for manipulating TLBs. The translation lookaside buffer invali- 
date entry (tlbie) instruction invalidates a subset of the TLB based only on the page index. Any 
entry that may correspond to that page index is invalidated (the invalidation is done without 
accessing the segment registers). The translation lookahead buffer invalidate all (tlbia) instruction 
invalidates the entire TLB. Synchronization requirements are associated with these instructions, 
which are described in the section “Synchronization Requirements for Segment Registers, Page 
Table Entries, and Segment Table Entries.” 


64-Bit Virtual Memory Architecture 


The 64-bit PowerPC virtual memory architecture differs primarily in the segment registers. 
The formats of the PTEs are different, but the fields and their meanings are the same (see Fig- 
ure C.6). The same is true of the 64-bit version of SDR1, except that instead of having the 
HTABMASK field, the 64-bit SDR1 contains a field (HTABSIZE) that specifies the number 
of 1 bits in the mask, which is used in the same way as the HTABMASK field in the 32-bit 


architecture (see Figure C.7). 






FIGURE C.6. 
128-bit page table entry: 
64-bit architecture. 








: vsio sic APL sof e-Resorvedyy [Hae 
FPN rfp Reserved Res|OsforWIMG ex] RSs] 02PPos 










FIGURE C.7, 


The organization of storage 
description register 1 


0 HTABORG " go HTABSIZE ,, 


* HTABORG - Real address of the page table. 


(SDR1) and the formation * HTABSIZE - Encoded size of the page table. 
of the page table entry . . 
group a ddress (64-bit Page table group physical address generation: 
architecture). HTABMASK = (26HTABSIZE), |, HTABSIZE , 
HTABORG 9,7 || (HTABORG 19.45 | (HTABMASK & HASH o07)) || HASH 9-38 || 70 


* HASH - The 39 bit hash value. 


319 


SS ee ee New 


a 


iit < 


SS A Oe me SE De 


et 


ee Se 


ERE LE a ee eI PD TN EA FOE IETS RD aT tO OT ES aD 


oF 


SS 


PartIV Wi Appendixes 


Instead of the 16 segment registers in the 32-bit architecture, the 64-bit architecture defines a 
one-page segment table (STAB). The STAB is organized into 32 groups of 8-segment table 
entries (STEs) (see Figure C.8). Each STE has the following fields: 


ESID Effective segment ID. 
V This is set to 1 for valid STEs. 
z This bit identifies direct store segments. The description given here is 
for normal segments (T=0). 
K, This is the page protection key for operating system code (MSR,, = 0) 
K, This is the page protection key for application code (MSR,, = 1). 
N If N=1, then this segment is a no-execute segment. 
VSID This is the virtual segment ID. 
FIGURE C.8. 
The segment table entry | |a____ESID_____asfogResorvedes Veo Te]KesaKry| Naf RO80rV0d 
format (normal segments). 





Accessing the STAB is similar to accessing the page table. The address space register (ASR) 
identifies the real address of the STAB (ASR,,, is the base address of the STAB, and ASR,,,¢, is 
reserved). Bits 31 through 35 of the effective address identify the index into the STAB of the 
primary segment table entry group (STEG). The logical inverse of these bits form the offset 
into the STAB of the secondary STEG. Both the primary and secondary STEGs are searched 
for an appropriate STE. A match for an effective address is found if STE, = 1, and STE 
EA,,,,- After a matching STE is found, the fields are extracted from the STE and are used in 
the same way as they were in the 32-bit architecture. 


Just like with the page table, hardware typically will have a segment lookaside buffer (SLB) in 
order to speed up translation. The synchronization requirements for software relative to this 
structure are described in the section “Synchronization Requirements for Segment Registers, 
Page Table Entries, and Segment Table Entries,” later in this appendix). 


Two instructions are provided for managing the SLB. These instructions are available only on 
64-bit implementations of the PowerPC architecture. The SLB invalidate entry instruction (slbie) 
invalidates any entries in the SLB that correspond to the effective address of the instruction. 
The SZB invalidate all instruction (slbia) invalidates the entire SLB. 


Direct Store Segments 


If the T field in a segment register is set to 1, then the segment is a direct store segment. These 
segments are meant for use with I/O devices and are an alternative to using memory-mapped 
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I/O. The format of the segment register for a direct-store segment is shown in Figure C.9 (32- 
bit architecture) and Figure C.10 (64-bit architecture). 


= This bit identifies the segment as a direct store segment. 

K, This is the page protection key for the operating system code (MSR,,, = 0). 
K, This is the page protection key for the application code (MSR,,, = 1). 
BUID Bus unit [D. 


The remaining bits have different meanings for different I/O controllers. 


FIGURE C.9, 






The segment register format To] Ks] Keo 3 BUID 11] 12 controller specific 31 
(direct store segments). 

FIGURE C.10. 

The segment table entry a ESI ssfogRoserveds] Vn Te Kose Mon Joes erved ys 
format (direct store 

segments) 0 controller specific “ 


When translating a direct store address, translation stops after the segment register lookup. The 
following information is sent to the controller (identified by the BUID): 
M@ The page protection key (K, if MSR,, = 0, or K, if MSR,, =1). 


Note that page protection is not implemented in the processor for direct storage 
segments. 


M Some portion of the segment register (see the User’s Manual for the specific imple- 
mentation you are using). 


M™ Some portion of the effective address (see the User’s Manual for the specific imple- 


mentation you are using). 


The following instructions are not supported for direct store segments (the results of these 
instructions for T=1 are boundedly undefined): 

M@ eciwx 

MM ecowx 

M@ ldarx 

@ lwarx 

M stdcx. 

Mi stwex. 


Finally, the following instructions have no effect in direct store segments: 


M@ dcbf 
M@ dcbi 
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M dcbt 
M@ dcbst 
@ dcbtst ; 
M dcbz } 
@ icbi 





Synchronization Requirements for Segment | 
Registers, Page Table Entries, and Segment | 
Table Entries 


This section describes the synchronization requirements software must follow in order to guar- 
antee correct execution of virtual memory structure manipulation. 


Context Synchronizing Instructions : 


In order to implement the synchronization requirements that follow, context synchronizing ) 
instructions typically are used. In order to understand context synchronization, remember that : 
in a typical modern microprocessor, many operations are occurring in parallel; and, unlike in ; 
the strict architectural model, operations may occur out of order. Normally, the processor ensures 
that this out-of-order and concurrent execution of instructions is invisible to the programmer; 
however, that takes hardware and, in general, slows the execution of instructions. As a com- 
promise, many of the operating-system specific operations (especially ones that are infrequent) i 
are required to have software-synchronization support to ensure correct operation. 


An operation is context synchronizing if the following conditions are met: 


@ Instructions following this operation are performed in the context set by this opera- t 
tion (context is the machine state, especially MSR, and virtual memory structures). ; 


M@ Instructions preceding this operation are performed in the context prior to this 
operation. 


An example of a context-synchronizing instruction is the instruction synchronize instruction 
(isync). This instruction simply waits for all instructions in front of it to complete and then ; 
flushes any following instructions that have begun execution out of the processor (requiring 


them to be refetched). The isync instruction therefore exactly implements context synchroni- 
zation. 


There are other instructions and events that are context synchronizing (sc, rfi, and most inter- 
rupts, for example). 
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In the following descriptions, lock() and unlock() are subroutines that provide exclusive access 
to the structure passed. Therefore, if you want to update a shared object, you issue the follow- 
ing code: 

lock(object) 


This code gives you exclusive access to that object. When you have finished updating that object, 
you issue the following code: 


unlock(object) 


This code releases the object so that another processor or thread can lock it and update it. 


Synchronization for Segment Register Updates 


When updating segment registers, the only synchronization required is that a context- 

synchronizing instruction must be placed on either side of the updating instruction. The fol- 

lowing code therefore is a valid way to update segment register 1 (from integer register 0): 
isync 


mtsr sri, r@8 
isync 


Synchronization for Page Table Entry Manipulation 


There are three operations on page tables: 
M@ Adding a new PTE 
M Modifying an existing PTE 
M Deleting a PTE 


TLBs are not coherent caches of the page table, so software must ensure coherency between 
different TLBs. 


Adding a Page Table Entry 


The following pseudo-code can be used to add a page table entry: 


lock (PTE) 
PTE ery, H,APr <- new values 
PT EL py,A,c,WIMG, PP <- new values 
sync 
PTE, <- Ob1 
unlock(PTE) 


The sync instruction is required to ensure that the PTE is set up before the valid bit is set. 
Without the sync instruction, an implementation might reorder the storage operations so that 
the valid bit is set before the rest of the PTE is set up, resulting in a window during which an 
invalid translation might occur. 
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Modifying a Page Table Entry 


The following pseudo-code can be used to modify a page table entry: 


lock (PTE) 

PTE, <- Q 

sync 

tlbie(PTE) 

sync 

tlbsync 

sync 

PT Every. u,aPz <- new values 
RPN,R,C,wIMG,pp ~" NeW Values 


sync 
PTE, <- Ob1 
unlock(PTE) 


The tlbsync instruction ensures that the preceding tlbie has been seen by all processors in a 
system. 


Deleting a Page Table Entry 


The following pseudo-code can be used to delete a page table entry: 


lock (PTE) 
PTE, <- 0 
sync 
tlbie(PTE) 
sync 
tlbsync 
sync 

unlock (PTE) 


Synchronization for Segment Table Entry Manipulation 


There are three operations on segment tables: 
M Adding a new STE 
M@ Modifying an existing STE 
M@ Deleting an STE 


Segment LBs are not coherent caches of the segment table, so software must ensure coherency 
between different SLBs. 


Adding a Segment Table Entry 


The following pseudo-code can be used to add a segment table entry: 


lock (STE) 
STE. c:0,1,xs,xp,v <- New values 
if T=0 then 

STE,... <- new value 


VSID 
else 
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STE,, <- new value 


sync 
STE, <- 
unlock(STE) 


Modifying a Segment Table Entry 


The following pseudo-code can be used to modify a segment table entry: 


lock (STE) 
STE, <= @ 
sync 
slbie(STE) 
sync 
STE sro.,ks,Kp,v ~~ MEW values 
if T=0 then 
STE,.,, <- new value 
else 


STE,, <- new value 
sync 
STE, <~ 1 
unlock(STE) 


Deleting a Segment Table Entry 


The following pseudo-code can be used to delete a segment table entry: 


lock (STE) 
STE, <- 0 
sync 
slbie(STE) 
sync 
unlock(STE) 


Multiprogramming and Multiprocessor 
Synchronization 


Some means of avoiding race conditions is necessary to implement preemptive multitasking 
and cooperating processes in any computer system. Classically, such a mechanism is provided 
architecturally by an atomic read-write memory operation—an instruction that appears to read 
a value from memory and (perhaps optionally) replace it with a new one. A broad discussion of 
operating system theory and algorithms is beyond the scope of this book; for a complete dis- 
cussion of race conditions, mutual exclusion, and an explanation of the reasons for these in- 
structions, see any good operating systems text. This book was written on the assumption that 
if you’re reading this section, you understand the necessity of synchronization primitives for 
shared memory communications. 


The PowerPC instruction set does not provide any atomic read-write memory instructions per 
se. Instead, the architecture provides the capability to implement atomic memory operations as 
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sequences of PowerPC instructions. The mechanism that enables this process is called a reser- 
vation. The architecture defines two instructions based on the reservation principle: load word 
and reserve indexed (lwarx) and store word conditional indexed with record (stwex.). There are 
corresponding double-word instructions—Idarx and stdcx.—defined for 64-bit implementa- 
tions. Other than the size of the data reference, they are identical, and are not discussed fur- 


ther. 


The reservation concept is fairly simple. When a lwarx instruction is executed, the value held 
at the memory location is placed in the destination register; additionally, the processor places 
a reservation on the block of memory containing the loaded word. If the processor still holds 
the reservation and executes a stwcx. instruction to the same location, the store takes place and 
the condition code is set (the EQ bit in cr0 is set to 1). Otherwise, the stwcx. instruction has no 
effect. 


The reservation is held by the processor until one of the following occurs: 


1. The processor executes another lwarx instruction. The first reservation is lost, and a 
new one is made. A processor can hold only one reservation at a time. 


2. The processor executes a stwex. instruction, regardless of whether the address matches 
the current reservation. 


3. Another processor executes a store to an address in the same reservation block (see the 
following sidebar). 


4. The processor is interrupted or trapped to supervisor state. Actually, the interrupt 
itself doesn’t release the reservation, but normally, the interrupt handler clears it to 
avoid problems such as one process improperly inheriting a reservation from another. 


The synchronization algorithms used in multiprogramming systems are defined in terms of a 
particular primitive hardware operation. The reservation concept and the associated PowerPC 
instructions enable fairly straightforward implementation of popular synchronization primi- 
tives and emulation of atomic instructions defined in other machine architectures. 


Following are examples of these coded to be called as C functions. An appropriate ANSI C 
prototype is included in a comment preceding each definition. 


Before using any of popular synchronization primitives in your applications, make sure that 
you understand the concepts of reservation and reservation granularity (see the following 


sidebar). 


RESERVATIONS AND RESERVATION GRANULARITY 
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Fetch and Nop 


This function returns the current value of a variable. This primitive allows the fact that the 
value is being read by one process to be detected by another process that may be in the middle 
of updating the value (because the store conditional would cause that other process to lose its 
reservation): 


# long Fetch_and_Nop (void *Var); 


# 
.Fetch_and_Nop: 
lwarx 7r4,0,r3 # Load and reserve Var 
stwex. £4,0,r3 # Store value back 
# ..if still reserved 
bne- _fetch_and_nop 
# Loop if lost reservation 
mr r3,r4 # Return current value 


belr 


S27 
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Fetch and Store 


This operation fetches the current value of a variable, atomically replacing it with a new one. 
The value of the variable before the replacement is returned: 


# long Fetch_and_Store (void *Var, long NewVal) ; 


tt 
.Fetch_and_Store: 
lwarx r5,0,r3 # Get current value 
stwex. r4,0,r3 # Try to replace with NewVal 
bne- _Fetch_and_Store 
# Loop if lost reservation 
mr r3,rs # Return previous value 
belr 


Fetch and Add 


This operation atomically adds to the current value of a variable. The variable’s unincremented 
value is returned: 


# long Fetch_and_Add (void *Var, long Incr); 


# 
.Fetch_and_ Add: 
lwarx r5,0,r3 # Get current value 
add rd,r4,r5 # Add Incr to value 
stwcex. r0,0,r3 # Try to store it back 
bne- _Fetch_and_Add 
# Retry if reservation lost 
mr r3,rs # Return previous value 
belr 


Some algorithms use a similar primitive, except performing a Boolean operation such as logi- 
cal AND or OR on the variable. A Fetch_and_AND or Fetch_and_OR could be implemented 


by simply replacing the add instruction with the appropriate logical instruction. 


Test and Set 


Test and Set is a commonly occurring primitive in some algorithms, and even appears as an 
atomic instruction in some processor architectures. The primitive returns when the value in 
storage is nonzero. If the current value is nonzero, it is returned. If it is equal to zero, it is set 
atomically to the provided nonzero value. The caller checks to see whether the returned (pre- 
vious) value is zero; if so, it is known that Test_and_Set routine replaced it with the caller’s 
value: 


# long Test_and_Set (void *Var, long NewVal); 


# 
. Test_and_Set: 
lwarx r§,0,r3 # Get old value and reserve 
cmpi r5,0 # Old value zero? 
bne- done # Branch if no 
stwex. £74,0,r3 # Replace zero with NewVal 


bne- . Test_and_Set 
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# Retry if reservation lost 


done: 
mr r3rs # Return old value 
bclr 
Compare and Swap 


This primitive is based on the definition of the IBM System/370 Compare and Swap instruc- 
tion. It is similar to Test and Set, except the comparison is with a specified value instead of 
zero. If the current value of the variable is equal to the specified comparison value, it is replaced 
with a second value, and the old value is returned. Otherwise, the variable is unchanged, and 
its current value is returned: 


# long Compare_and_Swap (void *Var, long CmpVal, long Newall) ; 
# 


.Compare_and_ Swap: 


lwarx r6,0,r3 # Get old value and reserve 
cmp r4,r6 # old == CmpVal? 
bne- done # If not, return current 
stwcx. rs6,0,rs3 # If equal, try to replace 
bne - .Compare_and_Swap 

# Retry if reservation lost 

done: 

mr r3,ré6 # Return previous value 
belr 


In all these examples, you code the conditional branch thar tests the result of the store-condi- 
tional to indicate your expectation that the store will succeed. 


Lock and Unlock 


The section on virtual memory management (“32-Bit Virtual Memory Management”) described 
earlier used a lock/unlock method of synchronization. Here is an implementation of these primi- 


tives using Test and Set: 


# void lock(void *Var); 
# 


# Var is unlocked if 0, locked if 1 


. lock: 
li P43 
loop: bl 
bne- loop 
isync 
bir 


# void unlock(void *Var); 
# 


# 


. test_and_set 


# Loop until unlocked 
# Wait for prior instr 
# return 


# Var is unlocked if @, locked if 1 


.unlock: 
sync 
itp f r4,0 
stw r4,0(r3) 
bir 


# Wait for any prior stores 
# Set lock to zero 

# 

# return 
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PowerPC Interrupt Architecture 


The PowerPC exception-handling mechanism enables the processor to change state and react 
to program requests, signals, errors, or external events. Except for certain catastrophic condi- 
tions (like Machine Check or System Reset) PowerPC interrupts are precise, and are handled 
in program order. If two interrupting instructions are executed out of order, for example, the 
first interrupt signaled is the one for the instruction that is earlier in program order, regardless 
of the order in which the instructions actually are executed. 


cs 


When an interrupt occurs, the following sequence of events is initiated: 


1. The current value of the MSR is copied to SRR1. Some additional bits to indicate the 
type of the interrupt also are placed in SRR1. 


LE eee 
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2. An instruction address is placed in SRRO. Depending on the type of interrupt, this 
address may be the effective address of the instruction causing the interrupt; the 
instruction that would have been executed next had the interrupt not occurred; or, in 
some cases, an instruction near the point at which the interrupt occurred (in some 
execution modes, some PowerPC interrupts may be imprecise). 


———————S Sa 
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3. The MSR is set to disable instruction and data translation, and to enter supervisor 
state. 


4. Control is transferred to an interrupt vector location corresponding to interrupt type; 
vector addresses for each interrupt type are shown in Table C.3. 


5. Additional information about the interrupt may be present in other PowerPC regis- 
ters. On a data storage exception, the data address register (DAR) contains the 
interrupting address (the DAR can be read with the mfspr instruction), for example. 
Consult the User’s Guide for a specific processor for complete details of interrupt 


signaling. 
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The interrupt handler invoked to process an exception should examine the machine state and 
determine what action is required. In some cases, the required action can be performed di- 
rectly; in other cases, a record of the event is posted to a queue for later processing. Sometimes, 
execution of the halted program is resumed at the point of interruption. The return from inter- 
rupt instruction (rfi) provides an operation inverse to an interrupt—it restores the appropriate 
MSR bits from SRR1, and branches to the location in SRRO. 





Care must be taken to ensure that the state of one process is not incorrectly carried across an 
interrupt-handling boundary. There are two principal concerns: instructions and references 
executed out of order, and memory reservations. isync may be required in an interrupt handler 
when it changes the MSR or SRs for turning on DR or IR, for example. Also, a held memory 
reservation is notcleared by an exception; to prevent incorrect inheritance of a reservation across 
the rfi, the handler may need to execute an stwcx. (or stdcx.) instruction to clear any pending 
reservations. 
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Table C.3. PowerPC interrupt vector locations. 


Vector Offset (hex) Interrupt Type 
00000 Reserved 

00100 System Reset 

00200 Machine Check 
00300 Data Storage 

00400 Instruction Storage 
00500 External 

00600 Alignment 

00700 Program 

00800 Floating-Point Unavailable 
00900 Decrementer 

0O0A00 Reserved 

O0OBOO Reserved 

00CO00 System Call 

00D00 Trace 

OOEOO Floating-Point Assist 
0OE10-O0FFF Reserved 


01000-02FFF Reserved, implementation-specific 





A Detailed Floating-Point 
Model 
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This appendix is supplied primarily for those who want to optimize floating-point code. A basic 
introduction to floating point is given in Chapter 2, “Introduction to the PowerPC Architec- 
ture,” and this is probably sufficient for most programmers. If you need to know the exact 
behavior of floating-point code (to avoid pathological rounding errors, for example) or how to 
control the exception behavior of floating-point code, however, then this appendix provides 
those details. 


Included in this appendix are the algorithms used for rounding and conversion between inte- 
ger and floating point values. First we will describe the floating point status and control regis- 
ter in more detail than in previous chapter, including the exception status and control mecha- 
nism. 


Floating-Point Status and Control Register 


The floating-point status and control register (FPSCR) is used by software to control certain 
floating-point operations, and by hardware to report certain conditions that arise during 
floating-point execution. There are 32 bits in the FPSCR (see Figure D.1). The bits of the 
FPSCR are described below (see Table D.1). The exception bits are described in the “Floating- 
Point Exceptions” section. 
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Table D.1. Floating-point status and control registar bit definitions. 


bit(s) Name of field Meaning of field 


Floating-Point Exception The processor sets this bit to one if any of the floating point 

Summary (FX) exception bits in the FPSCR change from zero to one 
during the execution of an instruction, regardless of the 
corresponding enable bit for that exception. 


Floating-Point Enabled This bit indicates whether any enabled exceptions have 

Exception Summary (FEX) occurred. It is set by ORing all the floating point exception bits 
in the FPSCR (see below) masked by their enable bits (also in 
the FPSCR - see below). 


Floating-Point Invalid Operation This bit indicates whether any invalid operation exception 
Exception Summary (VX) have occurred. It is set by ORing all the invalid operation 
exception bits in the FPSCR (see below). 


Floating-Point Overflow This indicates whether an overflow exception occurred. 
Exception (OX) 
Floating-Point Underflow This indicates whether an underflow exception occurred. 
Exception (UX) 
Floating-Point Zero Divide This indicates whether a divide by zero exception occurred. 
Exception (ZX) 


Floating-Point Inexact Exception | This indicates whether an inexact result exception has 

(XX) occurred. This is a 'sticky' (meaning that instructions set it to 
one implicitly, but once set, it stays set until explicitly reset) 
version of the FPSCREFI bit 

(see below). 


Floating-Point Invalid Operation This bit is set if the floating-point invalid operation exception occurs 
Exception (SNaN) because an attempt was made to operate on a signaling NaN. 
(VXSNAN) 


Floating-Point Invalid Operation This bit is set if a floating-point invalid operation exception occurs 
Exception (cc-cc) (VXISI) because an attempt was made to perform a magnitude 
subtraction of infinities. 





Floating-Point Invalid Operation This bit is set if a floating-point invalid operation exception occurs 
Exception (e/co) (VXIDI) because an attempt was made to divide infinities 


Floating-Point Invalid Operation This bit is set if a floating-point invalid operation exception occurs 
Exception (0/0) (VXZDZ) because an attempt was made to divide zero by zero. 


Floating-Point Invalid Operation This bit is set if a floating-point invalid operation exception occurs 
Exception (ce x 0) (VXIMZ) because an attempt was made to multiply an infinity by zero. 


12 Floating-Point Invalid Operation This bit is set if a floating-point invalid operation exception occurs 
Exception (invalid compare) because an attempt was made to make an ordered comparison 
(VXVC) involving a NaN. 


Floating-Point Fraction Rounded | This bit indicates whether the last arithmetic or 

(FR) conversion/rounding instruction to execute that rounded the 
intermediate result incremented the fraction 
(rounded away from zero). 
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Table D.1. continued 


Name of field Meaning of field 


Floating-Point Fraction Inexact 
(Fl) 


Floating-Point Result Flags 
(FPRF) 


Floating-Point Invalid Operation 
Exception (Software Request) 
(VXSOFT) 


Floating-Point Invalid Operation 
Exception (Invalid Square 


Root) (VXSQRT) 


Floating-Point Invalid Operation 
Exception (Invalid Integer 


Convert) (VXCVI) 


Floating-Point Invalid Operation 
Exception Enable (VE) 


Floating-Point Overflow 
Exception Enable (OE) 


This bit indicates whether the last arithmetic or 
conversion/rounding instruction to execute rounded the 
intermediate result or caused a disabled overflow exception (see 
FPSCROE below). 

This is a 5-bit field consisting of two subfields: the Floating Point 


Result Class Descriptor (C), and the Floating Point 
Condition Code (FPCC) fields. These bits are set as shown 


below. 
FPCC: Type of value for the result 
>, 


O0b0001 Quiet NaN 

0b1001 Negative Infinity 

0b1000 Negative Normal Number 
0b1000 Negative Denormalized Number 
0b0010 Negative Zero 

0b0010 Positive Zero 

0b0100 Positive Denormalized Number 


0b0100 Positive Normal Number 


0b0101 Positive Infinity 


This bit is set explicitly by software in order to request an invalid 
operation to occur. 


This bit is set if a floating-point invalid operation exception occurs 
because an attempt was made to perform a square root operation 
on a negative non-zero number. 


This bit is set if a floating-point invalid operation exception occurs 
because an attempt was made to convert a NaN or a large 
number to an integer (large meaning outside the representable 
range for the type of integer requested). 
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Name of field Meaning of field 
Floating-Point Underflow 
Exception Enable (UE) 


27 Floating-Point Zero Divide 
Exception Enable (ZE) 
0b10 Round toward positive infinity 


Floating-Point Inexact Exception 
Enable (XE) 
: 0b11 Round toward negative infinity 


Floating-Point Exceptions 


(NI) 
Several floating-point exceptions are defined by the PowerPC architecture: 

















If this bit is set to one, the meanings of all the other FPSCR 
bits may be different from how they are described here. In 
addition, the results of floating point operations may differ from 
the IEEE defined results. If the result of a floating point 
operation normally would be a denormalized number, then the 
result is set to zero. In general, if this bit is set to one, 
the operation of the floating point unit is inplementation-specific, 
and the User's Manual for the particular implementation of 
interest should be consulted. Note that different 
implementations may behave differently when this bit is set to 
one. 




















This is a 2-bit field that determines the rounding mode used by 
the processor. The rounding modes are shown below. 


FPSCRan Rounding Mode 


Sopot si Round toward Zero 


Floating-Point Rounding 
Control (RN) 







M@ Invalid operation exception: 
SNaN 
Infinity minus infinity 
Infinity divided by infinity 
Zero divided by zero 
Invalid compare 
Software request 
Invalid square root 
Invalid integer convert 

M@ Zero divide exception 


M@ Overflow exception 
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M@ Underflow exception 


M@ Inexact exception 


These exceptions may occur during computational operations, or if the FPSCR,,,.... bit is set 
by software (for the software request invalid operation exception). Each of these exceptions has 
a bit in the FPSCR that is used to record an occurrence of that exception (including each type 
of invalid operation exception). In addition, each of these five exception classes has an enable 
bit in the FPSCR in order to enable or disable that class of exception. Finally, there is a sticky 
bit for each class of exception that records the occurrence of that class of exceptions (the sticky 
bits must be reset by software). 


When an exception occurs, the result of the operation may be suppressed (the instruction does 
not update a general floating-point register). The cases when the result is suppressed are en- 
abled invalid operation exceptions, and enabled zero divide exceptions. In all other cases, the 
result is written to the floating-point register (possibly destroying one of the operands if the 
target register also is an operand register for that instruction). 


In addition to the enable bits in the FPSCR, there are two bits in the machine state register 
(MSR) that define the action the processor takes when a floating-point exception occurs (see 
Table D.2 and Appendix C, “Operating System Design for PowerPC Processors”). 


Table D.2. Definitions of the floating-point exceptions enabled bit in the MSR. 





MSRFEO MSRFE! Mode Description 
0 0 Ignore exceptions In this mode, the system 
mode floating-point exceptions 


handler is not invoked 
when a floating-point 
exception occurs. 


0 1 Imprecise In this mode, the system 
nonrecoverable floating-point exceptions 
mode handler is invoked when 


an enabled floating-point 
exception occurs. The 
handler is invoked at some 
point after the excepting 
instruction has executed, 
and the results of that 
execution may have been 
used by a subsequent 
instruction before the 
handler is invoked. 
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MSRFEO MSRFE1 Mode Description 
l 1 Imprecise recoverable In this mode, the system 
mode floating-point exceptions 


handler is invoked when 
an enabled floating-point 
exception occurs. The 
handler is invoked at some 
point after the excepting 
instruction has executed, 
but the results of that 
execution will not have 
been used by a subsequent 
instruction before the 
handler is invoked. 


l 1 Precise mode In this mode, the system 
floating-point exceptions 
handler is invoked when 
an enabled floating-point 
exception occurs. The 
handler is invoked 
precisely when the except- 
ing instruction executes. 


In general, enabling floating-point exceptions is useful for debugging code, but if possible, 
exceptions should be disabled (in both the FPSCR and MSR) for best performance. 


Floating-Point Models 


This section contains suggested models for floating-point operation. Some models are presented 
as algorithms written in pseudo-code. 
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IEEE Floating-Point Model 


This section describes the IEEE double-precision floating-point model. The IEEE single- 
precision model is analogous to the double-precision model. The magnitude of the fraction 
portion of an IEEE double-precision floating-point value is 53 bits long. A sign bit is used to 
specify the sign of the number (this is called signed-magnitude representation). The format used 
for the accumulator that performs the IEEE arithmetic includes several bits in addition to these 
(see Figure D.2). 














FIGURE D.2. 
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IEEE floating-point + S: The sign bit. 

arithmetic. 


* C: The carry bit. This bit receives the carry out from the significand. It 
is used during result normalization. 

+ L: The leading bit. The most significant bit of the significand. This 
receives the implicit bit from the operands 

* Fraction: The fractional portion of the significand (the mantissa). 

* G: The guard bit. This is an extension of the low order bits of the sig- 
nificand used for normalization and rounding. 

* R: The round bit. This is an extension of the low order bits of the sig- 
nificand used for normalization and rounding. 

* X: The sticky bit. This is an extension of the low order bits of the sig- 

nificand used for rounding. 


The guard and round bits are used during normalization of the result. They are shifted left 
during normalization, with zeros being shifted into the round bit. The carry bit also may be 
used during normalization. If the carry bit is set, in order to normalize the result, it must be 
shifted right by 1 bit (so the carry is shifted into the leading bit). The sticky bit is an OR of all 
bits that are less significant than the round bit. It is not used during normalization. The G, R, 
and X bits are used during rounding (see Table D.3). 


Table D.3. The G, R, and X bits. 
Re xX Interpretation 








0 0 0 The intermediate result is exact. 

0 0 1 The intermediate result is closer to the next lower 

0 l 0 representable number. 

0 1 l 

] 0 0 The intermediate result is exactly halfway between 
the two closest representable numbers. 

1 0 1 The intermediate result is closer to the next higher 

1 ] 0 representable number. : 

l l l 
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When the intermediate result is rounded, it is incremented by 1 (rounded up) or not (rounded 
down). The pseudo-code shown in Model D.1 determines whether the significand is 
incremented. 


MODEL D.1, 





Floating-Point Multiply-Add Model 


The PowerPC architecture defines an instruction form that performs up to three operations in 
a single cycle: a multiply, an add, and a negate. The effect of this special form on computa- 
tional results is a more exact intermediate result for the combined operations than would be 
obtained by performing separate operations. The intermediate result (the result of the multi- 
ply portion of the combined operation) can have as many as 106 significant bits (the product 
of two 53-bit significands). All 106 bits of this intermediate result are fed into the add portion 
of the combined operation. The accumulator used for the PowerPC multiply-add model is 
shown in Figure D.3. 
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FIGURE D.3. 
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multiply-add arithmetic. + C: The carry bit. This bit receives the carry out from the significand. It 


is used during result normalization. 

+ L: The leading bit. The most significant bit of the significand. This 
receives the implicit bit from the operands. 

* Fraction: The fractional portion of the significand (the mantissa). 

* G: The guard bit. This is an extension of the low order bits of the sig- 
nificand used for normalization and rounding. 

* R: The round bit. This is an extension of the low order bits of the sig- 
nificand used for normalization and rounding. 

* X: The sticky bit. This is an extension of the low order bits of the sig- 

nificand used for rounding. 


Floating-Point Round to Single-Precision Model 


The floating point round to single-precision instruction has the following format: 
frsp[.] FRT, FRB 


The pseudo-code shown in Model D.2 implements the frsp instruction. 
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Floating-Point Convert-to-Integer Model 
The floating-point convert-to-integer instructions have the following form: 
convert[.] FRT, FRB 


The operand in floating-point register FRB is converted to an integer and placed into floating 
register FRT. The pseudo-code shown in Model D.3 implements the PowerPC convert- 
to-integer model. 





MODEL D.3, 
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Floating-Point Convert-from-Integer Model 
The floating-point convert-from-integer has the following form: 
fcfid[.] FRT, FRB 


The contents of FRB are treated as a 64-bit sign integer, which is converted to a floating-point 
number and stored in FRT. The pseudo-code shown in Model D.4 implements the fefid in- 
struction. 


MODEL D.4. 











E 
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Obviously, portions of code developed in assembly language won't run directly on another E 
platform except in emulation, and emulation is not usually a good way to achieve high perfor- 
mance. However, in some cases, the hardware underpinnings can have implications even for i 
high-level language code. Since most software developers strive for platform independence, it's 
important to understand key differences between processor, system, and software architectures 
to avoid unnecessary stumbling blocks to portability. To this end, we'll try to outline some of 
the key computability concerns among PowerPC machines and systems based on other archi- 
tectures. 


Motorola 680x0 and Intel 80x86 Processors 


Many of the potential problems in developing code for Apple Macintosh machines based on 
the Motorola 680x0 processors and PowerPC processors have been solved very neatly by Apple 
and ISVs developing for the Macintosh platform. Apple provides complete emulation for the 
680x0 processor on Power Macs, and virtually all applications written for 680x0 Macs run with 
good performance without modification on PowerPC Macs. The Mac OS provides ways of 
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packaging applications with native code for both platforms (fat binaries); making it easy to 
maintain one set of application binaries, even on a network server used by Macs of both types. ; 

= 

: 
From a coding standpoint, there are a few compatibility issues that bear mention. First is the 
stricter alignment rules for best performance on PowerPC processors; second is the choice of 
floating-point representation. Better floating-point performance can be achieved with native 


PowerPC floating point than with Apple’s SANE library, which is supported only in emula- 


tion on PowerPC Macintoshes. 


The Intel 80x86 architecture is used in the IBM PC and compatibles. Aside from the instruc- 
tion set, a significant difference between PC-compatibles and Macintoshes is the byte ordering 
in memory (see “Big Endian Versus Little Endian Byte Ordering” later in this chapter). The 
PowerPC processors support little endian memory access mode, so it would be possible to build 
an emulation environment like the Power Macintosh for the IBM PC, but there is not cur- 
rently such a system available. The Microsoft Windows NT operating system depends on little- 
endian operation, and this OS has been ported to the PowerPC platform. | 


Se BR a 


Like the Apple machines, some IBM PC applications use non-IEEE floating-point arithmetic, 
and converting applications to run with native PowerPC floating point may require code 
changes. 


IBM POWER and POWER2 


The easiest migration path to PowerPC processors is from the IBM RS/6000 workstations. 
These machines are based on the POWER or POWER2 architecture that served as the basis 
for the PowerPC architecture. There are some instruction set differences; the MQ register and 
associated instructions were eliminated from PowerPC in favor of more general multiply and 
divide instructions, and the Iscbx string instruction was dropped in PowerPC. The POWER 
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instructions are all supported in the PowerPC 601 for compatibility but don’t appear in the 
other implementations. The POWER2 architecture has some instructions like the quadword 
floating-point instructions that allow loading and storing of two floating point registers at a 
time (which takes advantage of POWER2’s 128-bit cache interface). 


All of the POWER and POWER2 instructions will be supported through software emulation 
under IBM’s AIX operating system, but best performance can be obtained by avoiding these 
instructions in code for PowerPC systems. Compilers can be conditioned to generate “inter- 
section” code that only uses instructions that work on all RS/6000 platforms. 


PowerPC 601, 603, 604, and 620 Processors 


All PowerPC processor implementations run the same applications programs, and from a strictly 
functional point of view, it’s not really necessary to worry about compatibility. However, there 
are some differences in these implementations that may require changes to system manage- 
ment code in the operating system. Also, the approach taken to implement the PowerPC 
architecture for each may result in differences in the ways to achieve best performance. Some- 
times this is by design, and sometimes due to constraints of the implementation, such as the 
amount of silicon available for processing resources. For example, the 601 has a unified cache 
and the 604 has split instruction and data caches. Depending on the resource demands of a 
particular program, one design approach might yield better performance than another. 


It would be impossible to cover in detail all of the ways in which the PowerPC processor imple- 
mentations differ, let alone explore all of the subtle ways in which performance could be af- 
fected, and Table E.1 summarizes some of the important ones. 


Table E.1. PowerPC implementation differences. 


601 603 604 620 

Architecture 32-bit 32-bit 32-bit 64-bit 

Instructions 3 3 4 4 

Issued/Cycle 

Instructions 8 2 4 4 

Fetched/Cycle 

Branch Static Static Dynamic Dynamic 

Prediction BHT BHT 
512-entry 512-entry 

On-chip cache 32K 8K/8K 16K/16K 32K/32K 

Instruction/Data Unified 

Cache 8-way 2-way/ 4-way/ 8-way 

Associativity 2-way 4-way 8-way 

Data bus width 64-bit 32/64-bit 64-bit 128-bit 
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Big Endian Versus Little Endian Byte Ordering 


The obstacle to compatibility among different system architectures is the order in which the 
bytes of a register are stored in memory when more than one byte is stored at a time. There are 
two conventions for byte ordering used in computer systems today, and these are commonly 
known as big endian and little endian. If a 4-byte word is stored at a memory location, loading 
the individual bytes of that word, one at a time, yields different results depending on the memory 
ordering convention used by that system. 


To illustrate, suppose a general register contains a hexadecimal register 12345678. The proces- 
sor issues a store instruction that writes the contents of the register to a word of storage begin- 
ning at memory address 100. Then a load instruction to read the byte (rather than the word) at 
location 100 is issued. What value is returned? In a system using big endian byte order, the 
value is hexadecimal 12. If the ordering is little endian, the hex value is 78. Figure E.1 
shows why. 


FIGURE E.1. register value | 12 | 34 | 56 |78 | 


Big endian versus little 
endian byte ordering. 


when stored at memory location 100: 
Big Endian Little Endian 


100 101 102 103 100 101 102 103 


12|34|56|78 78 | 56 | 34 | 12. 


Most of the time, the byte ordering convention is transparent. The order used by a particular 
machine can only be determined if memory words can be accessed as both words and bytes, or 
when trying to share data between machines using opposite orderings. Historically, IBM (ex- 
cept for the PC) and Motorola processors have been big endian, while DEC and Intel proces- 
sors (used in the IBM PC) have been little endian. This creates a dilemma, since the two most 
popular personal computer architectures, the IBM PC and the Apple Macintosh, use different 
byte ordering. This makes it difficult to share data between the two machines, and challenging 
to build one machine that can run code for both with ease. 


Several features of PowerPC processors permit them to use either mode of memory access. A 
bit in a special register (in HIDO for the 601, in the MSR for the others) selects the processor's 
memory access mode. It is possible to address memory opposite the current mode by using the 
byte-reversed load and store instructions (lhbrx, lwbrx, sthbrx, and stwbrx). These instructions 
allow, for example, little endian storage access while running in big endian mode with no per- 
formance penalty. 
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This is important because the current family of PowerPC processors can yield better perfor- 
mance running in big endian mode. There is better hardware handling of misaligned big endian 
accesses—misaligned references and the multiple-register and string load and store instructions 
are trapped to software emulation in little endian mode. 


Many artifacts of endianness can be handled transparently by a compiler. When coding in as- 
sembler, it can be more difficult, but it is still possible in many cases to keep code insensitive to 
endianness by always manipulating memory in the natural mode width of the scalar data items. 
Endian-neutral development is discussed in some detail in a two-part article by James Gillig in 
Dr. Dobb’s Journal. \t is highly recommended for developers concerned with compatibility (see 
Appendix G). 


Quick References 
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e 
Instruction Set 
Table F.1. The PowerPC instruction set. 
Mnemonic Arguments Description Side Effects 
add[o][.] RT,RA,RB add [XER,,.XER,y] 
[CR] 
addc[o]|[.] RT,RA,RB add carrying XER._., [XER,, 
XER,y] [CR,] 
adde[o]|[.] RT,RA,RB add extended [XER,.XER,,,] 
[CR,] 
addi RT,Ra’®,SI add immediate 
addic[.] RT,RA,SI add immediate carrying XER.., [CR,] 
addis RT,Ra?®,SIS add immediate shifted 
addme[o]|.] RT,RA add to minus one [XER,,.XER,\] 
extended [CR,] 
addze[o][.] RT,RA add to zero extended [XER,,XER,,,] 
[CR|] 
and[.] RT,RA,RB and [CR] 
andc[.] RT,RA,RB and with compliment [CR,] 
: (RA RB) 
andi. RT,RA,UI and immediate [CR,] 
andis. RT,RA,UIS and immediate shifted [CR,] 
b{l] [a] SIY¥ branch [LR] 
be[I] [a] BO,BI,SI” branch conditional [LR] 
bectr[]] BO,BI branch conditional to [LR] 
count register 
belr{I] BO,BI branch conditional to [LR] 
link register 
clridi[.]* RT,RA,N clear left doubleword [CR] 
immediate 
clrisldi[.]* RT,RA,MB,N clear left and shift left [CR] 
doubleword immediate 
clrislwi[.]* RT,RA,MB,N clear left and shift left [CR,] 


word immediate 


clrlwi[.]* RT,RA,MB clear left word immediate [CR,] 





Mnemonic 


clrrdi[.]* 


clrrwi[.]* 
cmp 
cmpd* 
cmpdi* 
cmpld* 
cmpldi* 


cmpi 
cmpl 
cmpli 
cmpw* 
cmpwi* 
cmplw* 


cmplwi* 


cntlzd[.] 
cntlzwI.| 
crand 


crandc 


crclr* 


creqv 


crmove* 
crnand 
crnor 
crnot*™ 
cror 


crore 


crset* 


Arguments 
RT,RA,ME 


RT,RA,ME 
CRF,L,RA,RB 
Crf,RA,RB 
CRF,RA,SI 
CRF,RA,RB 
CRF,RA,UI 


CRF,L,RA,SI 
CRF,L,RA,RB 
CRF,L,RA,UI 
CRF,RA,RB 
CRF,RA,SI 
CRF,RA,RB 
CRF,RA,UI 


RT,RA 
RT,RA 


CRBt,CRBa,CRBb 
CRBt,CRBa,CRBb 


CRBt 


CRBt,CRBa,CRBb 


CRBt,CRBa 


CRBt,CRBa,CRBb 
CRBt,CRBa,CRBb 


CRBt,CRBa 


CRBt,CRBa,CRBb 
CRBt,CRBa,CRBb 


CRBt 
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Description Side Effects 
clear right doubleword [CR,] 
immediate 

clear right word immediate [CR,] 
compare 


compare doubleword 
compare doubleword immediate 
compare logical doubleword 


compare logical doubleword 
immediate 


compare immediate 
compare logical 

compare logical immediate 
compare word 

compare word immediate 
compare logical word 


compare logical word 
immediate 


count leading zeros doubleword [CR,] 
count leading zeros word [CR,] 
condition register and 


condition register and with 
compliment (CRBa* CRB6) 


condition register clear 


condition register equivalent 
(xnor) 


condition register move 
condition register nand 
condition register nor 
condition register not 
condition register or 


condition register or with 


compliment (CRBa@CRBO) 


condition register set 


continues 
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Table F.1. continued 





Mnemonic Arguments Description Side Effects 

crxor CRBt,CRBa,CRBb condition register xor 

dcbf Ra°,RB data cache block flush 

dcbi? Ra°®,RB data cache block invalidate 

dcbst Ra°®,RB data cache block store 

dcbt Ra°,RB data cache block touch 

dcbtst Ra°,RB data cache block touch for 
store 

dcbz Ra°,RB data cache block set to zero 

divd[o][.] RT,RA,RB divide doubleword (RA+RB) [XER,,XER,\] 

[CR,] 

divdulo][.] RT,RA,RB divide doubleword [XER..XER.,,] 
unsigned (RA+RB) [CR,] 

divw[o][.] RT,RA,RB divide word (RA+RB) [XER,,,.XER,,] 

[CR,] 

divwufo][.] RT,RA,RB divide word unsigned [XER,,.XER,,] 
(RA+RB) (CR,] 

eciwx RT,Ra°,RB external control word in 

ecowx RT,Ra°,RB external control word out 

eieio enforce in-order execution 
of I/O 

eqv(.] RT,RA,RB equivalent (xnor) [CR,] 

extldi*[.] RT,RA,N,MB extract and left justify [CR,] 
doubleword immediate 

extlwi*[.] RT,RA,N,MB extract and left justify [CR,] 
word immediate 

extrdi*[.] RT,RA,N,MB extract and right justify [CR,] 
doubleword immediate 

extrwi*[.] RT,RA,N,MB extract and right justify [CR,] 
word immediate 

extsh[.] RT,RA extend sign byte [CR,] 

extsw].| RT,RA extend sign half [CR,] 

extsb[.] RT,RA extend sign word 
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Mnemonic — Arguments Description Side Effects 

fabs[.] FRT,FRA floating absolute value [CR] 

fadd[.] FRT,FRA,FRB floating add double FPSCR [CR] 
precision 

fadds[.] FRT,FRA,FRB floating add single FPSCR [CR,] 
precision 

fcfid[.] FRT,FRA floating convert from FPSCR [CR,] 
integer doubleword 

fempo CRF,FRA,FRB floating compare FPSCR 
ordered 

fempu CRF,FRA,FRB floating compare FPSCR 
unordered 

fetid|.] FRT,FRA floating convert to FPSCR [CR] 
integer doubleword 

fctidz[.] FRT,FRA floating convert to FPSCR [CR] 


integer doubleword with 
round toward zero 


fctiw[.] FRT,FRA floating convert to FPSCR [CR,] 
integer word 
fetiwz[.] FRT,FRA floating convert to FPSCR [CR,] 


integer word with 
round toward zero 


fdiv{.] FRT,FRA,FRB floating divide double FPSCR [CR,] 
precision (FRA+FRB) 
fdivs|.] FRT,FRA,FRB floating divide single FPSCR [CR] 


precision (FRA+FRB) 
fmadd[.] FRT,FRA,FRB,FRC floating multiply add FPSCR [CR,] 


double precision 
(FRAX FRBX FRC) 


fmadds|.| FRT,FRA,FRB,FRC floating multiply add FPSCR [CR,] 
single precision 


(FRAXFRB+ FRC) 
fmr{.] FRT,FRA floating move register [CR] 
fmsub[.] FRT,FRA,FRB,FRC floating multiply FPSCR [CR,] 


subtract double precision 


(FRAX FRB- FRC) 


continues 
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Table F.1. continued 


Mnemonic 


fmsubs[.] 


fmul|[.] 
fmuls|.] 
fnabs|.] 


fneg[.] 
fnmadd|.] 


fnmadds|.] 


fnmsub[,.] 


fnmsubs|.] 


fres[.] 


frsp[.] 


frsqrte[.] 


fsel[.] 


Arguments 
FRT,FRA,FRB,FRC 


FRT,FRA,FRB 
FRT,FRA,FRB 
FRT,FRA 


FRT,FRA 
FRT,FRA,FRB,FRC 


FRT,FRA,FRB,FRC 
FRT,FRA,FRB,FRC 
FRT,FRA,FRB,FRC 


FRT,FRA 
FRT,FRA 


FRT,FRA 


FRT,FRA,FRB,FRC 


Description 


floating multiply 
subtract single 
precision 


(FRAXFRB- FRC) 
floating multiply 
double precision 
floating multiply 
single precision 
floating negative 
absolute value 
floating negate 
floating negative 
multiply add double 
precision 

(-[ FRAx FRB+ FRC}) 
floating negative 
multiply add single 
precision 

(-[ FRAX FRB+ FRC}) 
floating negative 
multiply subtract 
double precision 


(-[ FRAx FRB-FRC}) 
floating negative 
multiply subtract 

single precision 

(-| FRAx FRB-FRC}) 
floating reciprocal 
estimate single precision 


floating round to single 
precision 

floating estimate 
reciprocal square root 
double precision 


floating select 


Side Effects 
FPSCR [CR] 


FPSCR [CR] 
FPSCR [CR] 
[CR]] 


[CR] 
FPSCR [CR] 


FPSCR [CR|] 


FPSCR [CR] 


FPSCR [CR] 


FPSCR [CR] 
FPSCR [CR] 


FPSCR [CR] 





Mnemonic 


fsqrt[.] 
fsqrts[.] 
fsub[.] 
fsubs|.] 
icbi 
inslwi[.]* 
insrdi[.]* 
insrwi[.]* 


isync 


la* 


lbz 
Ibzu 


Ibzux 
lbzx 


Id 
Idarx 


Idu 


Idux 


lfd 


Arguments 
FRT,FRA 


FRT,FRA 
FRT,FRA,FRB 
FRT,FRA,FRB 
Ra°,RB 
RT,RA,N,MB 
RT,RA,N,MB 


RT,RA,N,MB 


RT,SI(Ra’) 
RT,SI(Ra’) 
RT,SI(RA) 


RT,RA,RB 
RT,Ra°,RB 


RT,SI%(Ra’) 
RT,RA,RB 


RT,SIY(RA) 
RT,RA,RB 


RT,Ra°,RB 
FRT,SI(Ra’) 
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Description 

floating square root 
double precision 
floating square root 
single precision 
floating subtract 
double precision 
floating subtract 
single precision 
instruction cache block 
invalidate 


insert from left word 
immediate 


insert from right 
doubleword immediate 


insert from right word 
immediate 


instruction synchronize 
load address 
load byte and zero 


load byte and zero with 
update 


load byte and zero with 
update indexed 


load byte and zero 


indexed 
load doubleword 
load doubleword and 


reserve 


load doubleword with 
update 


load doubleword with 
update indexed 


load doubleword indexed 


load floating double 


precision 


Side Effects 
FPSCR [CR] 


FPSCR [CR] 
FPSCR [CR] 


FPSCR [CR] 


[CR] 


[CR] 


[CR] 


RSRV 
RA 


RA 


continues 
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Table F.1. continued 





Mnemonic —_ Arguments Description Side Effects 

lfdu FRT,SI(RA) load floating double RA 
precision with update 

lfdux FRT,RA,RB load floating double RA 
precision with update 
indexed 

lfdx FRT,Ra°,RB load floating double 
precision indexed 

Ifs FRT,SI(Ra’) load floating single 
precision 

lfsu FRT,SI(RA) load floating single RA 
precision with update 

lfsux FRT,RA,RB load floating single RA 
precision with update 
indexed 

lfsx FRT,Ra°,RB load floating single 
precision indexed 

lha FRT,SI(Ra°) load halfword algebraic 

lhau RT,SI(RA) load halfword algebraic RA 
with update 

lhaux RT,RA,RB load halfword algebraic RA 
with update indexed 

lhax RT,Ra°,RB load halfword algebraic 
indexed 

lhbrx RT,Ra°,RB load halfword byte- 
reversed indexed 

lhz RT,SI(Ra’) load halfword and zero 

lhzu RT,SI(RA) load halfword and zero RA 
with update 

lhzux RT,RA,RB load halfword and zero RA 
with update indexed 

lhzx RT,Ra°,RB load halfword and zero 
indexed 

li* RT,SI load immediate 


lis* RT;SP load immediate shifted 


Mnemonic 


lmw 
Iswi 
lswx 
lwa 
lwarx 


lwau 
lwaux 
lwax 
lwbrx 


lwz 


lwzu 
lwzux 
lwzx 
merf 
mcerfs 
merxr 
mfcr 
mfctr* 


mffs[. | 
mflr* 


mfmsr? 


Arguments 


RT,SI(Ra’) 
RT,RA,N 

RT,Ra°,RB 
RT,SI(Ra°) 
RT,Ra°,RB 
RT,SI(RA) 


RT,RA,RB 
RT,Ra°,RB 
RT,Ra°,RB 


RT,SI(Ra’) 
RT,SI(RA) 


RT,RA,RB 
RT,Ra°,RB 
CRFt,CRFa 
CRFt,FBFa 
CRF 

RT 

RT 


FRT 
RT 


RT 
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Side Effects 


Description 


load multiple word 

load string word immediate 

load string word indexed 

load word algebraic 

load word and reserve RSRV 


load word algebraic RA 
with update 


load word algebraic RA 
with update indexed 


load word algebraic 
indexed 


load word byte reversed 
indexed 


load word and zero 


load word and zero RA 
with update 
load word and zero RA 


with update indexed 


load word and zero 
indexed 
move condition register 


field 


move to condition 


register field from FPSCR 


move XER to condition 
register field 


FPSCR 


move from condition 

register (CR) 

move from count register 

(CTR) 

move from FPSCR [CR] 


move from link register 
(LR) 
move from machine state 


register (MSR) 


continues 
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Table F.1. continued 


Mnemonic 


mfspr"“ 


mfsr? 


mfsrin? 


mftb 


mftb* 
mftbu* 
mfxer* 


mr[.]* 


mtcr* 
mtcrf 
mfctr* 
mtfsb0[.] 
mtfsb 1 [.] 


mtfsf]. ] 
mtfsfi[.] 


mtlr* 
mtmsr’ 
mtspr’“ 


mtsr? 


mtsrin? 


Arguments 


RT,SPR 


RT,Sa 
RT,RA 


RT,TBR 


RT 
RT 
RT 


RT,RA 
RS 


FXM,RS 


RS 


Fbt 


Fbt 


FXM,FRS 
FBFt,UI 


RS 


RS 


RS,SPR 


St,RS 
RS,RA 


Description Side Effects 


move from special purpose 
register 
move from segment register 


move from segment register 
indirect 


move from time base 
register (TBR) 

move from time base 

move from time base upper 


move from fixed point 
exception register (XER) 


move register [CR,] 


move to condition 
register (CR) 
move to condition register 


fields 


move to count register 


(CTR) 


move to FPSCR bit 0 [CR] 
(clear FPSCR bit) 


move to FPSCR bit 1 [CR ] 
(set FPSCR bit) 


move to FPSCR fields [CR,] 
move to FPSCR field [CR,] 


immediate 





move to link register 
(LR) 


move to machine state 
register (MSR) 


move to special purpose 
register 


move to segment register 


move to segment register 
indirect 
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Mnemonic Arguments Description Side Effects 
mtxer* RS move from fixed point 
exception register (XER) 
mulhd{.] RT,RA,RB multiply high doubleword —[CR,] 
mulhdu[.] RT,RA,RB multiply high doubleword — [CR] 
unsigned 
mulhw[{.] RT,RA,RB multiply high word [CR] 
mulhwul[.] RT,RA,RB multiply high word unsigned [CR,] 
mulld[o][.] RT,RA,RB multiply low doubleword [XER,,,XER,\] 
[CR] 
mulli RT,RA,SI multiply low immediate 
mullw[o][.] RT,RA,RB multiply low word [XER,,,.XER,v] 
[CR,] 
nand[.] RT,RA,RB nand [CR,] 
neg[o]|[.] RT,RA negate [XER,,,XER,\] 
[CR,] 
nop* no operation 
nor{.] RT,RA,RB nor [CR,] 
not[.]* RT,RA not [CR,] 
or[.| RT,RA,RB or [CR,] 
orc[.| RT,RA,RB or with compliment [CR,] 
(RA/RB) 
ori RT,RA,UI or immediate 
oris RT,RA,UI or immediate shifted 
rfi return from interrupt MSR 
rldcl{.] RT,RS,Rn,MB rotate left doubleword [CR,] 
then clear left 
rldcr{.] RT,RS,Rn,ME rotate left doubleword [CR,] 
then clear right 
ridic[.] RT,RS,N,MB rotate left doubleword [CR,] 
immediate then clear 
ridicl[.] RT,RS,N,MB rotate left doubleword [CR,] 
immediate then clear left 
ridicr{.] RT,RS,N,ME rotate left doubleword [CR,] 


immediate then clear right 


continues 
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Table F.1. continued 





Mnemonic —_ Arguments Description Side Effects 

rldimi[.] RT,RS,N,MB rotate left doubleword [CR,] 
immediate then insert mask 

rlwimil.] RT,RS,N,ME rotate left word immediate [CR,] 
then insert mask 

rlwinm|[.] RT,RS,N,MB,ME rotate left word immediate [CR,] 
then and with mask 

rlwnm|[.] RT,RS,Rn,MB,ME _ rotate left word then and [CR,] 
with mask 

rotld[.]* RT,RS,Rn rotate left doubleword [CR] 

rotldi[.]* RT,RS,N rotate left doubleword [CR,] 
immediate 

rotlw[.]* RT,RS,Rn rotate left word [CR,] 

rotlwi[.|* RT,RS,N rotate left word immediate [CR,] 

rotrdi[.]* RT,RS,N rotate right doubleword [CR,] 
immediate 

rotrwi|.]* RT,RS,N rotate right word immediate [CR] 

sc system call . 

slbia® SLB invalidate all 

slbie” RA SLB invalidate entry 

sld[.] RT,RS,Rn shift left doubleword [CR,] 

sldi[.]* RT,RS,N shift left doubleword [CR,] 
immediate 

slw[.] RT,RS,Rn shift left word [CR,] 

slwi[.]* RT,RS,N shift left word immediate [CR,] 

srad|.] RT,RS,Rn shift right algebraic [CR,] 
doubleword 

sradi[.] RT,RS,N shift right algebraic [CR,] 
doubleword immediate 

sraw|.| RT,RS,Rn shift right algebraic word [CR] 

srawi(.] RT,RS,N shift right algebraic [CR,] 
word immediate 

srd[.] RT,RS,Rn shift right doubleword [CR,] 

srdi[.] RT,RS,N shift right doubleword [CR,] 


immediate 





Mnemonic 


srw{[.] 
srwil.| 
stb 
stbu 


stbux 


stbx 
std 


stdcx. 


stdu 


stdux 


stdx 
stfd 


stfdu 


stfdux 


stfdx 


stfiwx 


stfs 


stfsu 


stfsux 


stfsx 


Arguments 


RT,RS,Rn 
RT,RS,N 

RS,SI(Ra°) 
RS,SI(RA) 
RS,RA,RB 


RS,Ra°,RB 
RS,SI¥(Ra’) 
RS,Ra°,RB 
RS,SI“(RA) 


RS,RA,RB 


RS,Ra’,RB 
FRS,SI(Ra’) 


FRS,SI(RA) 


FRS,RA,RB 


FRS,Ra°,RB 
FRS,Ra°,RB 
FRS,SI(Ra’) 
FRS,SI(RA) 


FRS,RA,RB 


FRS,Ra°,RB 


Description 


shift right word 
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Side Effects 
[CR] 


shift right word immediate [CR,] 


store byte 
store byte with update 


store byte with update 
indexed 


store byte indexed 
store doubleword 


store doubleword 
conditional 

store doubleword 
with update indexed 
store doubleword with 
update indexed 


RA 
RA 


store doubleword indexed 


store floating double 
precision 

store floating double 
precision with update 


store floating double 
precision with update 
indexed 

store floating double 
precision indexed 
store floating as 
integer word 

store floating single 
precision 

store floating single 
precision with update 
store floating single 
precision with update 
indexed 

store floating single 
precision indexed 


continues 
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Mnemonic 
sth 
sthbrx 


sthu 


sthux 


sthx 
stmw 
stswi 
stswx 
stw 
stwbrx 
stwu 


stwux 


sStwx 
STWCX. 


sub[o][.]* 


subc[o][.]* 


subf[o][.] 


subfc[o][.] 


subfe[o] [.] 


subfic 
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Table F.1. continued 
Arguments 


RS,SI(Ra’) 
RS,Ra°,RB 


RS,SI(RA) 


RS,RA,RB 


RS,Ra°,RB 
RS,SI(Ra°) 
RS,Ra°,N 

RS,Ra°’,RB 
RS,SI(Ra’) 
RS,Ra’,RB 
RS,SI(RA) 
RS,RA,RB 


RS,Ra°,RB 
RS,Ra°,RB 
RT,RA,RB 


RT,RA,RB 


RT,RA,RB 


RT,RA,RB 


RT,RA,RB 


RT,RA,SI 


subfme[o][.] RT,RA 


Description 
store halfword 
store halfword byte- 


reversed 


store halfword with 
update 


store halfword with 
update indexed 


store halfword indexed 


store multiple word 


store string word immediate 


store string word 

store word 

store word byte-reversed 
store word with update 


store word with update ~ 
indexed 


store word indexed 
store word conditional 


subtract (RA+RB+ 1) 


subtract carrying 


subtract from (RA+RB+ 1) 


subtract from carrying 
(RA+RB+ 1) 


subtract from extended 
(RA+RB+ CA) 

subtract from immediate 
carrying (RA+/MM+- 1) 


subtract from minus one 


extended (RA+ XERCA- 1) 


Side Effects 


RA 
RA 


[RSRV] [CR,] 
[XER,»XER oy] 
[CR] 

XE 
[XER,.XER Gy 
[CR] 
[XER,.XER, 
[CR,] 

XER, 
[XER,,,.XER, 
[CR] 
[XER,.XER,\] 
[CR] 

XER.. 


[XER,,.XER oy] 
CA [CR,] 
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Mnemonic 


subfze[o] [.] 
subi* 
subis* 
subic[.]* 


sync 
td 

tdi 
tlbia® 
tlbie” 
tlbsync” 
tw 

twi 
xor[.| 
xori 


xoris 


Arguments 
RT,RA 


RT,RA,SI 
RT,RA,SIS 


RT,RA,SI 


TO,RA,RB 
TO,RA,SI 


RA 


TO,RA,RB 
TO,RA,SI 
RT,RA,RB 
RT,RA,UI 
RT,RA,UI 


Description Side Effects 


subtract from zero [XER,,.XER,,] 
extended (RA+ XERCA) CA [CR] 


subtract immediate 


(RA+IMM+ 1) 


subtract immediate shifted 
(RA+IMM+ 1) 


subtract immediate XER,., [CR,] 
carrying (RA+/MM+ 1) 


synchronize 








trap doubleword 

trap doubleword immediate 

TLB invalidate all 

TLB invalidate entry 

TLB synchronize 

trap word 

trap word immediate 

xor [CR,] 
xor immediate 


xor immediate shifted 


Table F.2. Key to instruction argument symbols. 


Symbol 
BI 
BO 


CRB 
CRF 
FRn 
FB 
FBF 
FXM 


Description 


Conditional branch BI field—selects condition register bit to test. 


Conditional branch BO field—selects options for instruction 


operation. 


Condition register bit. 


Condition register field. 


FPR n. 


FPSCR bit. 
FPSCR field. 


Register field mask—one bit per field. 


continues 
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Table F.2. continued 


NN 


Symbol 
L 


MB 
ME 
N 
Rn 
Sn 
SI 
TO 
UI 


Description 


Compare length bit (set to zero for 32-bit compares, one for 64-bit 


compares). 

Mask begin position. 

Mask end position. 

Count. 

GPR n. 

Segment Register n. 

Immediate field which will be sign extended. 
Trap TO field - selects conditions for trap. 


Immediate field which will be zero extended. 


Table F.3. Key to instruction notes. 


Symbol 
mnem! 
mnem?! 


mnem* 
Ra? 


[IS 
[|v 


* 


Description 

Instruction is priviledged. 

Instruction is priviledged for certain arguments. 

Instruction is an extended mnemonic for another instruction. 


When set to 0, this argument is ignored rather than being the 
contents of GPR 0. 


The immediate value is shifted left 16 bits. 
The immediate value must be word aligned 


Operating system dependent. 


Table F.4. Simplified branch mnemonics. 


Target address type 
Branch (bc) (bca) (bclr) (bcctr) 
Semantics (X*) relative absolute to link to count 
branch unconditionally =— — blir[I]a betr[1] 
branch if condition br[I] bt[I]a belr[I] btctr{1] 
true (t) 
branch if condition bf{I] bf[l]a bflr[I] bfctr{I] 


false (f) 
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Target address type 
Branch (bc) (bca) (bclr) (bcctr) 
Semantics (X*) relative absolute to link to count 


decrement count and bdnz[]] bdnz{l]a bdnazlr{I] — 
branch if count 
non-zero (dnz) 


decrement count and bdz{l] bdz{lla bdzlr[1] — 
branch if count zero (dz) 
decrement count and bdnzt[l] bdnzt{[l]a bdnztlr[l]] —_ 


branch if count 
non-zero and condition 
true (dnzt) 


decrement count and bdnzfT]] bdnzf[l]a bdnzflr[l] — 
branch if count non- 
zero and condition 


false (dznf) 

decrement count and bdzt[I] bdzt[l]a bdztlr[I] — 

branch if count zero 

and condition true (dzt) 

decrement count and bdzf]I] bdzf{[l]a bdzflr[I] — 

branch if count non- 

zero and condition 

false (dzf) 

a. The building block which is used to form the mnemonic is shown in parentheses beside the 
semantic description. The mnemonic for the branch is built out of three components: b[direction 


mnemonic]|[target mnemonic]. 


Table F.5. Simplified branch mnemonics with comparison conditions. 


Target address type 
Branch (bc) (bca) (bclr) (bcctr) 
Semantics (X*) relative absolute to link to count 
branch if less than ble[l] nit[l]a blelr[1] blectr[l] 
(It) 
branch if less than ble[I] ble[l]a blelr[I] blectr[l] 
or equal to (le) 


branch if equal to (eq) beq[I] beq[l]a beqlr{l] beqbrtr[I] 


continues 
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Table F.5. continued 


Target address type 
Branch (bc) (bca) (bclr) (bcctr) 
Semantics (X*) relative absolute to link to count 
branch if greater than bge[l] bge[l]a bgelr[I] bgectr[I] 
or equal to (ge) 
branch if greater than bet[I] bet[l]a betlr{]] bgctr[I] 
(gt) 
branch if not less bnl{l] bni{lla bnillr{I] bnictr{I] 
than (nl) 
branch if not equal bne[I] bne[lla bnelr[l] bnectr[l] 
to (ne) 
branch if not greater bng[l] bng{l]a bnglr{I] bngctr{I] 
than (ng) 
branch if summary bso[I] bso[l]a bsolr[I] bsoctr{I] 
overflow (so) 
branch if not summary __ bns[I] bns[I]a bnslr{I] bnsctr[I] 
overflow (ns) 
branch if unordered bun|{I] bun[l]a bunlr[]] bunctr[I] 


(see floating point 
compare instructions 


below) (un) 
branch if not unordered —_ bnu [I] bnul[lla bnulr[l] bnuctr[I] 


(see floating point 
compare instructions 


below) (nu) 


a. The building block which is used to form the mnemonic is shown in parentheses beside the 
semantic description. The mnemonic for the branch is built out of three components: b[CR 
code] [target code] [I]. | 








Register Set 


Table F.6. PowerPC register set. 
Symbol Width Access Description 
CR 32 User Condition Register 
LR 32,64 User Link Register 


Symbol 
CTR 

GPR 0 - 31 
XER 


FPR 0 - 31 
FPSCR 


MQ 
DEC 
SRR 0-1 


MSR 
DAR 
DSISR 


SPRG 0 - 3 
PVR 


IBAT 0 - 3 U/L 


DBAT 0-3 U/L 


HID 0-N 


TB 
RTCU 


RTCU 


ASR 


Width 
32,64 
32,64 
52 


64 
32 


32 
oP 
32,64 


32,64 
32,64 
32 


32,64 
32 
32,64 


32,64 


32 
Optional 


64 
32 


BL 


64 


Access 
User 
User 
User 


User 
User 


User 

Privileged 
Privileged 
Privileged 
Privileged 
Privileged 
Privileged 
Privileged 
Privileged 


Privileged 


Privileged 
Optional 


User Read 
User Read 


User Read 


Priviledged 
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Description 

Count Register 

General Purpose Registers 
Fixed Point Exception 
Register 

Floating Point Registers 


Floating Point Status and 
Control Register 


MQ Register (601 only) 
Decrementer 

Machine Status Save/ 
Restore Registers 
Machine State Register 
Data Address Register 
Data Storage Interrupt 
Status Register 
Software-use Special 
Purpose Registers 
Processor Version Register 
Instruction Block Address 
Translation Upper/Lower 
Registers (8 total) 

Data Block Address 
Translation Upper/Lower 
Registers (8 total) 
External Access Register 
(optional) 
Implementation-specific 
SPRs (optional) 

Time Base Register 

Real Time Clock Upper 
(601 only) 

Real Time Clock Lower 
(601 only) 

Address Space Register 
(64-bit implementations 
only) 


LLL LLL 
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Assembler 
Assembler Pseudo-Ops 


align on 


Forces the alignment of the next assembly element to occur on the next 2" byte boundary. If 
the current csect is of type PR or GL, then any alignment padding will be filled with the nop 


instruction. 

-byte expression|,expression...| 
Allocates a byte or region of bytes, initialized to the value of expression. 
comm name, expression|,n] 


Defines an uninitialized common block name, of size expression, aligned on a 2" byte bound- 


ary. 

.csect [name] |[storage_class|||,n] 

Specifies the csect that the following code or data belong in. The alignment of the csect may be 
set to a 2" byte boundary. 

double expression 


Allocates eight bytes of memory initialized to the double precision floating point value of ex- 
pression. Alignment padding will be added so that the value is word aligned. 


drop n 

Stops using register mas a base register. 

.dsect name 

Specifies the dummy csect that the following code or data belong in. 
-extern name 

Identifies name as a symbol from another module. 

float expression 


Allocates 4 bytes of memory initialized to the single precision floating point value of expression. 


Alignment padding will be added so that the value is word aligned. 
.globl | name 


Identifies name as a global symbol. 
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.lcomm name, expression|, section] 
Defines a block of uninitialized storage name of size expression in local common section section. 
long expression|,expression...| 


Allocates words of memory initialized to expression. Alignment padding will be added so that 
the data is word aligned. 


org expression 

Sets the value of the current location counter ($). 

.rename name, string 

Creates external alias string for internal symbol name. 

set name, expression 

Defines a symbolic constant name for expression. 

short expression|, expression...| 


Allocates halfwords of memory initialized to expression. Alignment padding will be added so 
that the data is halfword aligned. 


.space n 
Allocates n bytes of memory initialized to zero. 
string string 


Allocates initialized bytes of memory to hold string and terminating null. 


ste name|TC], expression|,expression...| 
Creates a TOC entry. 

-toc 

Indicates the TOC csect. 

.tocof symbol,name 


Defines local symbol name as a reference to another module’s TOC which contains the global 
symbol symbol. 


using expression,n 


Tells the assembler to use register # as a base register and to assume that it contains the value 
expression. 


.vbyte n, expression 


Allocates n bytes (4 maximum) of memory and initializes them to expression. 
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Table F.7. Assembler storage classes. 


Class Section Description 

PR text Program Code 

RO text Read Only Data 

GL text Glue Code 

p4e .data TOC Entry 

VA .data Unknown Type 

RW .data Read Write Data 

DS .data Function Descriptor 

BS .bss Uninitialized Read Write Data 
UC .bss Unnamed Common 


Table F.8. Assembler special symbols. 
Symbol Description 
$ Current location counter value. 


TOC[TCO] TOC anchor-address of the current module’s TOC. 








G 


Further Reading 
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General Reference 


Hennessy, John L. and David A. Patterson. Computer Architecture: A Quantitative Ap- 
proach, Palo Alto: Morgan Kaufmann, 1990. 


This computer architecture textbook is considered by many to be the “bible” of RISC archi- 
tecture. 


Hennessy, John L. and David A. Patterson. Computer Organization and Design. Palo 
Alto: Morgan Kaufmann 1994, 


This sequel to their first book, Computer Organization and Design focuses on system design 
and performance issues, as well as the interface of the processor to software and other system 
components. 


Johnson, Mike. Superscalar Microprocessor Design. New York: Prentice Hall, 1991. 


This book takes up where Hennessy and Patterson leave off, and covers the design of Superscalar 
(parallel pipeline) processors. 


Tanenbaum, Andrew S. Operating System Design and Implementation. Engelwood Cliffs: 
Prentice Hall, 1987. 


A very readable text that focuses on the real problems of building operating systems, rather 
than “pretty” theoretical problems that clutter some books on the subject. Includes source code 
for Minix, a UNIX-like PC operating system. 


PowerPC Architecture and Implementations 
Books 


Duntemann, Jeff and Ron Pronk. Inside the PowerPC Revolution. Scottsdale: Coriolis 
Group, 1994, 


Where most of the other material written about PowerPC is strictly technical in focus, this is 
a book about where the computer marketplace and technology has been, where it’s going, and 
how PowerPC fits in. Full of history, insider information (occasionally, misinformation), and 
technical details, this is more a book about the PowerPC technoculture phenomenon than 
architecture or programming. 


IBM Corporation. The PowerPC Architecture. Palo Alto: Morgan Kaufmann, 1994. 


Books 1, 2, and 3 of the official PowerPC architecture specification are published in this vol- 
ume (the partnership developed these specifications at four levels, each described by a separate 
“book”). Books 1 through 3 describe the instruction set and programming environment. 
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IBM Microelectronics and Motorola. PowerPC 601 User’s Guide. 
IBM Microelectronics and Motorola. PowerPC 603 User’s Guide. 
IBM Microelectronics and Motorola. PowerPC 604 User’s Guide. 


These books describe the respective chip implementation in extensive (and sometimes stupe- 
fying) detail. Essential for system developers, these books can also be useful to programmers 
needing a reference for a particular chip. All include detailed descriptions of instruction tim- 
ing, pipeline interactions, and a wealth of other minutiae (the 601 and 603 versions include an 
instruction set reference). These books, and other detailed specifications are available from the 


manufacturers. Call IBM at 1-800-POWERPC or Motorola at 1-800-845-MOTO to order. 


Weiss, Shlomo and James E. Smith. POWER and PowerPC. Palo Alto: Morgan Kaufmann, 
1994. 


This book discusses modern computer architecture, studying how principles are applied in the 
design and implementation of the POWER and PowerPC machines. 


Young, Jerry. Insider’s Guide to PowerPC Computing. Indianapolis: Que Publishing, 1994. 


This book is a good introduction to the PowerPC architecture and features of the 601 and 603 
processor chips. This book is more readable than the specifications or manufacturers’ user's 
Guides, and is a good introduction to PowerPC chips for system designers and programmers 


alike. 


Articles 
Byte, 18(8), August 1993. 


This issue has several PowerPC articles including details of the PowerPC 601 processor, oper- 
ating system support, and an introduction to RISC. 


Communications of the ACM, 37(6), June 1994. 
This issue of CACM has a collection of articles on PowerPC, including the history of the archi- 


tecture and the partnership, the 603 processor, and the compilers and simulators available from 
Motorola. 


IEEE Micro, 14(5), October 1994. 


This issue of JEEE Micro features several PowerPC articles, including the PowerPC instruc- 
tion set, some details of the 601 and 604 implementations, and the PowerPC 60X bus design. 


“PowerPC 620 Soars.” Byte, 19(11) November, 1994: 113-120. 
This article describes the features of the 64-bit PowerPC 620. 
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Optimization and Performance 


Books 


Thompson, Tom. Power Macintosh Programming Starter Kit. Indianapolis: Hayden 
Books, 1994. 


A detailed treatment of high-level language development for the Power Macintosh, including 
a demonstration version of the Metrowerks CodeWarrior development toolset on CD-ROM. 


Articles 
Gillig, James R. “Endian-Neutral Software, Part 1.” Dr. Dobbs Journal220, October 1994. 


Gillig, James R. “Endian-Neutral Software, Part 2.” Dr. Dobbs Journal 222, November 
1994. 


This two-part series discusses strategies for avoiding program incompatability due to different 
memory byte ordering (endianness) conventions in different systems, and a discussion of the 
bi-endian features of the PowerPC architecture. 


Heisch, R.R. “Trace-directed program restructuring for AIX executables.” JBM Journal 
of Research and Development 38(5) September, 1994: 595-604. 


This article describes the methods used in the FDCR (feedback directed code restructuring) 


tool for optimizing cache performance mentioned in Chapter 7, “Performance Tuning and 
Optimization.” 


Thompson, Tom. “Power Mac Code Optimizations.” Byte 19(11), November 1994: 
291-292. 


A short article about improving performance of high-level language Power Macintosh programs, 
particularly considerations in the OS and library interfaces. 
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Mm + (plus sign) 


Symbols 


+ (plus sign) 
branch instruction 
mnemonics, 141 
extended mnemonics, 77 
- (minus sign), extended 
mnemonics, 77 
. (dot) 
instruction mnemonics, 
30, 92 
record option, 138 
32-bit 
floating-point conversion 
instructions, 98-99 
integers 
compare instructions, 
78-79 
division instructions, 38 
load instructions, 60 
multiplication instruc- 
tions, 35 
special loadl/store 
instructions, 65 
store instructions, 62 
logical operation instruc- 
tions, 42-43 
page translation mecha- 
nism (memory), 310-313 
rotate instructions, 51-52 
shift instructions, 47 
single-precision floating- 
point numbers, 90 
trap instructions, 85 
virtual memory manage- 
ment, 314-319 
601 processors, 5 
603 processors, 5 
620 processors, 6 
64-bit 
double-precision floating- 
point numbers, 90 
floating-point conversion 
instructions, 99-100 
integers 
arithmetic symbols, 
171-181 
compare instructions, 80 





division instructions, 39 
load instructions, 61 
multiplication instruc- 
tions, 37 
store instructions, 63 
logical operation instruc- 
tions, 44 
rotate instructions, 54-56 
shift instructions, 48 
trap instructions, 85 
virtual memory architec- 
ture, 319-320 


A 


absolute branch instruc- 
tions, 70 

accessing instructions from 
memory, 22 

add carrying (addcx) 
instruction, 149 

add extended (adde[o][.]) 
instruction, 31 

add immediate (addi) 
instruction, 31 

add immediate carrying with 
and without condition 
register record (addic[.]) 
instruction, 31 

add immediate shifted 
(addis) instruction, 31 

add instructions (integers), 
30-32 

add to minus one extended 
(addme[o][.]) instruc- 
tion, 31 

add to zero extended 
(addze[o][.]) instruc- 
tion, 31 

add[o]|[.] (basic add) 
instruction, 31 

add([o][.] (integer add) 
instruction, 191 

addc[o][.] (integer add 
carrying) instruction, 191 

addcx (add carrying) 


instruction, 149 





adde[o][.] (add extended) 
instruction, 31 
adde[o]|[.] (integer add 
extended) instruction, 192 
addi (add immediate) 
instruction, 31 
addi (integer add immedi- 
ate) instruction, 192 
addic|.] (integer add 
immediate carrying) 
instruction, 31, 192 
adding floating-point 
register contents, 93-94 
addis (add immediate 
shifted) instruction, 31 
addis (integer add immedi- 
ate shifted) instruc- 
tion, 193 ie 
addme[o]|[.] (integer add 
extended to minus one) 
instruction, 31, 193 
addresses 
branch instructions, 70 
effective, 310 
formats via instruc- 
tions, 142 
functions (registers), 106 
generating, 57 
immediate, 18 
incrementing, 145 
indirect, 18 
instructions (program 
counters), 70 
interrupts, 23 
memory, 17-18 
registers, 17 
subroutines, 23 
addze[o]|[.] (integer add 
extended to zero) instruc- 
tion, 31, 194 
algorithms, optimizing, 131 
alignment 
memory 
caches, 154 
references, 143 
stacks, 110 
Amdahl’s law (architec- 
ture), 10 


AND (Boolean operation) 
instruction, 40 
and[.] (logical AND) 
instruction, 194 
andc[.] (logical AND with 
compliment) instruc- 
tion, 194 
andi (logical AND immedi- 
ate) instruction, 195 
andis (logical AND shifted 
immediate) instruc- 
tion, 195 
applications, optimizing 
programming, 130 
architecture, 6-7 
64-bit virtual memory, 
319-320 
Amdahl’s law, 10 
branch processor, 22-23 
byte ordering conventions, 
354-355 
CISC, 5, 11-12 
floating-point, 20-22, 90, 
337-339 
implementations, 134 
instruction set, 4-5, 358 
integer processor, 19 
interrupts, 23, 330-331 
load/store instruction 
set, 11] 
performance, 10-11 
POWER instruction set, 
4, 143 
PowerPC, 13-23, 135-136 
PReP standard, 7 
register set, 374-375 
RISC, 4, 141-142 
text references (re- 
sources), 380 
arguments (stack frames), 
passing, 111 
arithmetic 
instructions, 30-39, 92-96 
symbols (64-bit integers), 
171-181 
arrays (integers), storing in 
memory, 73 


assembly language 
pseudo operations, 376 
programming development 
tools, 130 
routines 
recoding, 131 
string copy, 144 
storage classes, 378 


symbols, 378 
atomic memory operations, 
see reservations 


b[I] (unconditional branch) 
instruction, 195 
b[I] [a] (unconditional long 
branch) instruction, 71 
basic add (add[o] [.]) 
instruction, 31 
BAT (block address transla- 
tion) mechanism, memory, 
313-314 
bc[I][a] (branch conditional) 
instruction, 71, 196 
bectr[l] (conditional branch 
to count register) instruc- 
tion, 72, 80, 199 
belr[l] (conditional branch 
to link register) instruc- 
tion, 72, 200 
big endian, see endian 
bit codes (condition 
register), 76 
Boolean operations, 82 
extended mnemonics, 
76-77 
manipulating, 82 
setting, 78-80 
bits 
FPSCR, 101 
MSR, 304-306 
BO field encodings 
branch instructions, 71-72, 
196-197 
extended mnemonics, 


74-75 





branch processor architecture Wil 


Boolean operations 
condition register bit 
codes, 82 
instructions, 40-44 
symbols, 24-25 
Boyer-Moore string 
searches, 183-187 
branch and decrement 
branches instruction, 
70, 73 
branch and link (reg- 
isters), 106 
branch conditional instruc- 
tion (bc[I][a]), 71 
branch conditional to count 
register instruction 
(bectr[l]), 72 
branch conditional to link 
register instruction 
(belr[1]), 72 
branch instructions 
BO field encodings, 71-72, 
196-197 
branch to link register, 70 
coding in special-purpose 
registers, 15 
combining with compare 
instructions, 78 
condition register, 136-139 
conditional, 71 
count register decrement/ 
test capabilities, 146 
extended mnemonics, 74, 
76-77 
fields, 371-372 
mnemonics, 372-373 
moving between count/ 
link registers, 80-82 
PowerPC architecture, 
135-136 
prediction schemes, 136 
dynamic, 141 
static, 140-141 
special-purpose registers, 
80-82 
unconditional, 71 
branch processor architec- 
ture, 22-23 











branch to count instruc- 
tions, 70 
branch to link instructions, 
see return branch instruc- 
tions 
bytecopy function, 148 
bytes 
endian memory storage, 
58-59 
null (C strings), 147 
ordering conventions, 
354-355 
reversal instructions, 63 
signed, 59 


C 


C programming language 
strings, 147 
symbols, 163-168 
caches (memory), 150 
alignment, 154 
boundaries, 154 
capacity misses, 153 
flushing into memory, 210 
invalidating, 211, 233 
optimizing, 154 
properties, 312 
setting to zero, 213 
storing into memory, 212 
TLBs, 319 
calls 
functions, 106 
subroutines, 70 
carry bits (registers), 
148-149 
CISC (complex instruction 
set computer) architecture, 
5. 1 1-12, 378 
closing loops, 73 
cleldi[.] (clear left 
doubleword immediate) 
instruction, 200 
clrlsldi[.] (clear left and shift 
left doubleword immedi- 
ate) instruction, 200 
clrislwi (clear left and shift 
left word immediate) 
instruction, 50, 201 





@ 3 branch to count instructions 


clriwi[.] (clear left word 
immediate) instruction, 
50, 201 

clrrdi[.] (clear right 
doubleword immediate) 
instruction, 202 

clrrwi[.] (clear right word 
immediate) instruction, 
50, 202 

cmp (integer compare) 
instruction, 202 

cmpd (compare 
doubleword) instruc- 
tion, 79 

cmpd (compare 
doubleword) instruc- 
tion, 203 

cmpdi (compare doubleword 
immediate) instruction, 
79, 203 

cmpi (integer compare 
immediate) instruc- 
tion, 204 

cmpl (compare logical) 
instruction, 204 

cmpld (compare logical 
doubleword) instruction, 
79, 203 

cmpldi (compare logical 
doubleword immediate) 
instruction, 79, 204 

cmpli (compare logical 
immediate) instruc- 
tion, 205 

cmplw (compare logical 
word) instruction, 78 

cmplw (compare logical 
word) instruction, 206 

cmplwi (compare logical 
word immediate) instruc- 
tion, 78, 206 

cmpw (compare word) 
instruction, 78, 205 

cmpwi (compare word 
immediate) instruction, 
78, 205 

cntlxd[.] (count leading 
zeros doubleword) 
instruction, 206 


cntlzd[.] (count leading 
zeros doubleword) 
instruction, 42 
cntlzw[.] (count leading 
zeros word) instruction, 
42, 206 : 
code scheduling, 12 
coding branch instructions 
in special-purpose regis- 
ters, 15 
combining 
branch and compare 
instructions, 78 
floating-point multiply/ 
divide instructions, 95 
commands 
instruction set architecture, 
4-5 
UNIX, 133 
see also instructions 
compare and swap func- 
tion, 329 
compare instructions, 
78-80, 137 
comparing condition 
register operands, 78 
complex intruction set 
computer, see CISC 
compulsory misses (memory 
caches), 153 
condition register, 14 
bit codes, 76 
Boolean operations, 82 
extended mnemonics, 
76-77 
manipulating, 82 
setting, 78-80 
branch instructions, 
136-139 
comparison instruc- 
tions, 137 
copying XER high-order 
bits, 81 
data entry, 78-80 
field encodings, 78 
field instruction (merf), 83 


fields 
moving, 101 
operand, 75 
setting, 96 
logical instructions, 
82-83, 137 
operands, 78 
updating, 73 
conditional branch instruc- 
tions, 135 
conditional store word 
(stwcx.) instruction, 291 
conflict misses (memory 
caches), 154 
context synchronizing 
instructions, 322 
control flow instructions, 
70, 135 
convert floating-point to 
integer doubleword 
(fctid[.]) instruction, 222 
convert floating-point to 
integer doubleword with 
round toward zero (fctidz) 
instruction, 222 
convert floating-point to 
integer word (fctiw[.]) 
instruction, 223 
convert floating-point to 
integer word with round 
toward zero (fctiwz|.]) 
instruction, 223 
convert integer doubleword 
to floating-point (fcfid|[.]) 
instruction, 221 
convert|.] instruction, 346 
converting floating-point 
numbers to/from 
integers, 98 
copying XER high-order bits 
to condition register, 81 
count leading zero instruc- 
tions (integers), 42 


count register, 15, 23 
iteration variables, 73 
loops 

controlling, 146 
iterations, 138-139 
values, moving to/from 
link register, 80-82 
CR, see condition register 
crand (condition register 
AND) instruction, 207 
crandc (condition register 
AND with compliment) 
instruction, 207 
crclr (condition register 
clear) instruction, 207 
creqv (condition register 
equivalent) instruc- 
tion, 208 
crmove (condition register 
move) instruction, 208 
crnand (condition register 
NAND) instruction, 208 
crnor (condition register 
NOR) instruction, 208 
crnot (condition register 
not) instruction, 209 
cror (condition register OR) 
instruction, 209 
crorc (condition register OR 
with compliment) instruc- 
tion, 209 
crset (condition register set) 
instruction, 210 
crxor (condition register 
XOR) instruction, 210 
CTR, see count register 


D 


data cache block touch 
(dcbt) instruction, 155 

data cache block touch for 
store (dcbtst) instruc- 
tion, 155 

data entry (condition 
register), 78-80 





eieio instruction @ 


data sizes (integer proces- 
sors), 59 

dcbf (data cache block flush) 
instruction, 210 

dcbi (data cache block 
invalidate) instruction, 211 

dcbst (data cache block 
store) instruction, 212 

dcbt (data cache block 
touch) instruction, 212 

dcbt (data cache block 
touch) instruction, 155 

dcbtst (data cache block 
touch for store) instruc- 
tion, 155, 213 

dcbz (data cache block zero) 
instruction, 213 

debuggers, 130-132 

decrement instructions, 144 

development tools, program- 
ming, 130 

devices (PowerPC imple- 
mentation), 353 

direct store segments, 320 

divide instructions (inte- 
gers), 37-39, 214-215 

dividing floating-point 
register contents, 94-95 

dot (.) 

instruction mnemonics, 
30, 92 
record option, 138 

doubleword (data sizes), 59 

DRAM (dynamic random 
access memory), 12 

dynamic prediction schemes 
(branch instructions), 141 


E 


eciwx (external control word 
in) instruction, 215 

ecowx (external control 
word out) instruction, 216 

effective addresses, 310 

eieio (enforce in-order 
execution of I/O) instruc- 
tion, 216 











@ endian 


endian 
ordering conventions 
(bytes), 354-355 
storage (bytes), 58-59 
epilog functions (stacks), 
112-115 
eqv[.] (logical equivalent) 
instruction, 216 
exceptions (floating-point), 
338-339 
execution model (floating- 
point architecture), 20 
exponents (floating-point 
numbers), 20 
extended mnemonics 
64-bit rotation instruc- 
tions, 55-56 
BO field encodings, 74-75, 
197-198 
condition register bit 
encodings, 76-77, 
198-199 
moving between count/ 
link registers, 80 
TO field encodings, 
86-87, 297 
trap instructions, 87, 298 
extldi[.] (extract and left 
justify doubleword 
immediate) instruc- 
tion, 217 
extlwi[.] (extract and left 
justify word immediate) 
instruction, 49, 217 
extract and right justify 
word immediate (extrwi([.]) 
instruction, 49 
extrdi[.] (extract and right 
justify doubleword 
immediate) instruc- 
tion, 218 
extrwi[.] (extract and right 
justify word immediate) 
instruction, 49, 218 
extsb[.] (sign extend byte) 
instruction, 42, 218 


extsh[.] (sign extend 
halfword) instruction, 
42, 219 

extsw|.] (sign extend word) 
instruction, 42, 219 


F 


fabs[.] (floating-point 
absolute value) instruction, 
92, 220 

fadd|.] (floating-point add 
double) instruction, 
93, 220 

fadds|.] (floating-point add 
single) instruction, 93, 220 

fcfid[.] (convert integer 
doubleword to floating- 
point) instruction, 99, 
221, 349 

fempo (floating-point 
compare ordered) instruc- 
tion, 221 

fempu (floating-point 
compare unordered) 
instruction, 221 

fctid[.] (floating-point 
convert to integer 
doubleword) instruction, 
99, 222 

fctidz (convert floating- 
point to integer 
doubleword with 
round toward zero) 
instruction, 99, 222 

fctiw|.] (floating-point 
convert to integer word) 
instruction, 98, 223 

fctiwz|.] (floating-point 
convert to integer word 
with round to zero) 
instruction, 98, 223 

fdiv[.] (floating-point divide 
double) instruction, 
94, 223 

fdivs[.] (floating-point 
divide single) instruction, 
94, 224 

fetch and add function, 328 





fetch and nop function, 327 
fetch and store function, 328 
field encodings 
BO 
branch instructions, 
71-72, 196-197 
extended mnemonics, 
74-75 
condition register, 78 
SPR, 253-255 
TBR, 257 
TO, 86-87, 297 
field operand (condition 
registers), 75 
fields 
branch instructions, 
371-372 
condition register, 101 
FPSCR, 101 
FIFO (first in-first out) 
replacement memory 
caches, 152 
fixed point exception 
register (XER), 14, 149 
floating-point instructions, 
149-150 
absolute value 
(fabs[.]), 220 
add/subtract, 93, 220 
architecture, 90, 337-339 
compare, 96-97, 221 
conversion, 98-100, 
346-349 
divide double precision 
(fdiv[.]), 223 
divide single precision 
(fdivs[.]), 224 
estimate reciprocal square 
root double precision 
(frsqrte[.]), 230 
exceptions, 338-339 
load, 90-91 
move, 92 
move register (fmr[.]), 225 
multiply accumulate, 
95-96 
multiply accumulate 
double precision 


(fmadd[.]), 224 





multiply accumulate single 
precision (fmadds[.]}), 
224 
multiply double precision 
(fmul[.]), 226 
multiply single precision 
(fmuls[.]), 226 
multiply subtract double 
precision (fmsub[.]), 225 
multiply subtract single 
precision (fmsubs[.]), 226 
multiply/divide, 94-95 
negate (fneg[.]), 227 
negate multiply acumulate 
double precision 
(fnmadd[.]), 227 
negate multiply acumulate 
single precision 
(fnmadds|[.]), 228 
negate multiply subtract 
double precision 
(fnmsub[.]), 228 
single precision 
(fnmsubs[.]), 229 
negative absolute value 
(fnabs[.]), 227 
reciprocal estimate single 
precision (fres[.]), 229 
round to single precision 
(frsp[.]), 230, 342 
rounding, 97-98 
select (fsel[.]), 231 
square root double 
precision (fsqrt[.]), 231 
square root single precision 
(fsqrts[.]), 232 
store, 91 
subtract double precision 
(fsub[.]), 232 
subtract single precision 
(fsubs[.]), 232 
floating-point numbers 
components, 20 
converting to/from 
integers, 98 
rounding, 21-22, 97 
floating-point operations, 
339-350 


floating-point processor 
architecture, 20-22 
floating-point registers, 
14, 22 
arithmetic operations 
add/subtract, 93-94 
multiply/divide, 94-95 
contents, moving, 92 
floating-point status and 
control register (FPSCR), 
15, 100 
flushing cache blocks into 
memory, 210 
fmadd|.] (floating-point 
multiply acumulate double 
precision) instruction, 
95, 224 
fmadds|.] (floating-point 
multiply acumulate single 
precision) instruction, 
95, 224 
fmr[.] (floating-point move 
register) instruction, 
92, 225 
fmsub[.] (floating-point 
multiply subtract double 
precision) instruction, 
95, 225 
fmsubs[.] (floating-point 
multiply subtract single 
precision) instruction, 
95, 226 
fmul|[.] (floating-point 
multiply double precision) 
instruction, 94, 226 
fmuls[.] (floating-point 
multiply single precision) 
instruction, 94, 226 
fnabs|[.] (floating-point 
negative absolute value) 
instruction, 92, 227 
fneg[.] (floating-point 
negate) instruction, 
92, 227 
fnmadd|[.] (floating-point 
negate multiply acumulate 
double precision) instruc- 
tion, 95, 227 


fsub[.] instruction @ 


fnmadds|.] (floating-point 
negate multiply acumulate 
single precision) instruc- 
tion, 95, 228 
fnmsub[.] (floating-point 
negate multiply subtract 
double precision) instruc- 
tion, 95, 228 
fnmsubs[.] (floating-point 
negate multiply subtract 
single precision) instruc- 
tion, 95, 229 
formats (addresses) via 
instructions, 142 
FPR (Floating-Point 
Registers), 108 
FPSCR (floating-point 
status and control register), 
15, 100-102, 337-339 
frames (stacks) 
creating, 114-115 
destroying, 114-115 
link area, 110-111 
local, 111-112 
passing arguments, 111 
register save areas, 112 
fres[.] (floating-point 
reciprocal estimate single 
precision) instruction, 229 
frsp[.] (floating-point round 
to single precision) 
instruction, 97, 230, 
342-346 
frsqrte[.] (floating-point 
estimate reciprocal square 
root double precision) 
instruction, 230 
fsel[.] (floating-point select) 
instruction, 231 
fsqrt[.] (floating-point 
square root double 
precision) instruction, 231 
fsqrts[.] (floating-point 
square root single preci- 
sion) instruction, 232 
fsub[.] (floating-point 
subtract double precision) 
instruction, 93, 232 








M fsubs[.] instruction 


fsubs[.] (floating-point 
subtract single precision) 
instruction, 93, 232 
functions 
bytecopy, 148 
calls, 106 
compare and swap, 329 
fetch and add, 328 
fetch and nop, 327 
fetch and store, 328 
hash, 315 
leaf, 115 
linkage conventions, 
106-119 
lock, 329 
matrix multiply, 187-188 
registers 
addresses, 106 
nonvolatile, 107-109 
stacks, 109-113 
volatile, 107-109 
stacks 
epilog, 112-115 
passing arguments, 111 
prolog, 112-115 
test and set, 328 
unlock, 329 


G 


general floating-point 
registers 
contents 
saving as 32-bit single- 
precision floating-point 
numbers, 91 
saving as 64-bit single- 
precision floating-point 
numbers, 91 
see also floating-point 
registers 
general-integer register, 81 
general-purpose registers, 14 
GPR (General-Purpose 
Registers ), 6, 14, 
107-108, 143 





H-I 


halfwords, 59 

hardware prediction 
schemes, 73 

hash function, 315 

hashing strings, 182 

hits (memory caches), 152 


IBM POWER architec- 
ture, 143 
icbi (instruction cache block 
invalidate) instruction, 233 
IEEE (Institute of Electrical 
and Electronic Engineers) 
floating-point standard, 
340-341 
immediate addresses, 18, 39 
implementations 
architecture, 134 
processors, 133 
implicit targets (reg- 
isters), 30 
increment instructions, 144 
incrementing 
memory addresses, 145 
variable values, 328 
index variables 
as loop control vari- 
ables, 145 
sharing, 144-145 
indexed address genera- 
tion, 57 
indexed load and store 
instructions, 144-146 
indirect addresses, 18 
inslwi[.] (insert from left 
word immediate) instruc- 
tion, 50, 233 
insrdi[.] (insert from right 
doubleword immediate) 
instruction, 234 
insrwi[.] (insert from right 
word immediate) instruc- 
tion, 50, 234 
instruction set architecture, 
4-5, 358 





instructions, 358 
32-bit integer compare, 
78-79 
32-bit trap, 85 
64-bit integer compare, 80 
64-bit trap, 85 
absolute branch, 70 
accessing from memory, 22 
add[o][.], 31, 191 
addc[o][.], 191 
addcx, 149 
adde[o][.], 31, 192 
addi, 31, 192 
addic[.], 31, 192 
addis, 31, 193 
addme[o][.], 31, 193 
address program 
counters, 70 
addze[o][.], 31, 194 
_ AND (Boolean oper- 
ation), 40 
and|.], 194 
andc[.], 194 
andi, 195 
andis, 195 
arithmetic, 30-39, 92-96 
bil], 195 
bc[l][a], 196 
bectr[l], 199 
belr[I], 200 
Boolean operations, 40-44 
branch 
BO field encodings, 
71-72, 196-197 
branch to link reg- 
ister, 70 
coding in special purpose 
registers, 15 
combining with compare 
instructions, 78 
condition register, 
136-139 
conditional, 71 
dynamic prediction 
schemes, 141 
extended mnemonics, 
74-77 
fields, 371-372 
mnemonics, 372-373 





moving between count/ 
link registers, 80-82 
PowerPC architecture, 
135-136 
prediction schemes, 136 
special-purpose registers, 
80-82 
static prediction schemes, 
140-141 
unconditional, 71 
branch and decrement 
branches, 70, 73 
branch to count, 70 
byte reversal, 63 
clridi[.], 200 
clrisldi[.], 200 
clrislwi[.], 50, 201 
clrlwi[.], 50, 201 
clrrdi[.], 202 
clrrwi[.], 50, 202 
cmp, 202 
cmpd, 203 
cmpdi, 203 
cmpi, 204 
cmpl, 204 
cmpld, 203 
cmpldi, 204 
cmpli, 205 
cmplw, 206 
cmplwi, 206 
cmpw, 205 
cmpwi, 205 
cntlxd[.], 206 
cntlzd[.], 42 
cntlzw[.], 42, 206 
comparison, 78-80 
combining with branch 
instructions, 78 
conditional registers, 137 
conditional branch, 135 
context synchronizing, 322 
control flow, 70 
convert|.], 346 
count leading zero 
(integers), 42 
crand, 207 
crandc, 207 
crclr, 207 
creqv, 208 


crmove, 208 

crnand, 208 

crnor, 208 

crnot, 209 

cror, 209 

crorc, 209 

crset, 210 

crxor, 210 

dcbf, 210 

dcbi, 211 

dcbst, 212 

dcbt, 155, 212 

dcbtst, 155, 213 

dcbz, 213 

decrement, 144 

divd[o][.], 214 

divd[u][o][.], 39, 214 

divw[u][o][.], 38, 214 

divwul[o][.], 215 

eciwx, 215 

ecowx, 216 

eieio, 216 

eqv|.], 216 

execution time, 133 

extldi[.], 217 

extlwi[.], 49, 217 

extrdi[.], 218 

extrwi[.], 49, 218 

extsb[.], 42, 218 

extsh[.], 42, 219 

extsw[.], 42, 219 

fabs[.], 92, 220 

fadd[.], 93, 220 

fadds[.], 93, 220 

fcfid[.], 99, 221, 349 

fempo, 221 

fempu, 221 

fctid[.], 99, 222 

fctidz[.], 99, 222 

fctiw[.], 98, 223 

fctiwz[.], 98, 223 

fdiv[.], 94, 223 

fdivs[.], 94, 224 

floating-point, 149-150 
add/subtract, 93 
compare, 96-97 
conversion, 98-100 
convert to integer, 


346-349 


instructions Wf 


load, 90-91 
move, 92 
multiply accumulate, 
95-96 
multiply/divide, 94-95 
round to single pre- 
cision, 342 
rounding, 97-98 
store, 91 
fmadd[.], 95, 224 
fmadds[.], 95, 224 
fmr[.], 92, 225 
fmsub[.], 95, 225 
fmsubs[.], 95, 226 
fmull[.], 94, 226 
fmuls[.], 94, 226 
fnabs[.], 92, 227 
fneg[.], 92, 227 
fnmadd[.], 95, 227 
fnmadds[.], 95, 228 
fnmsub[.], 95, 228 
fnmsubs|[.], 95, 229 
FPSCR, 101-102 
fres[.], 229 
frsp[.], 97, 230 
frsqrte[.], 230 
fsel[.], 231 
fsqrt[.], 231 
fsqrts[.], 232 
fsub[.], 93, 232 
fsubs[.], 93, 232 
IBM POWER architec- 
ture, 143 
icbi, 233 
immediate addresses, 39 
increment, 144 
indexed load and store, 
144-146 
inslwi[.], 50, 233 
insrdi[.], 234 
insrwi[.], 50, 234 
integers 
addition, 30-32 
division, 37-39 
loading, 59-61 
multiplication, 34-37 
overflow bits, 149 
record option, 138 








instructions 


storing, 61-63 
subtraction, 32-34 

isyne, 235 

la, 235 

Ibz[u] [x], 59, 235-236 

Id[u] [x], 61, 237-238 

Idarx, 237 

Ifd[u][x], 90, 239-240 

lfs[u][x], 90, 240-241 

lha[u] [x], 59, 242 

lhbrx, 63, 243 

lhz[u] [x], 59, 244-245 

li, 245 

lis, 246 

Imw, 64, 246 

load and store (memory), 
57-67 

logical (condition register), 
82-83, 137 

logical operation, 39-44 

Iswi, 66, 246 

Iswx, 66, 247 

lwa[u]x], 60, 247-248 

lwarx, 247 

lwbrx, 63, 248 

lwz[u] [x], 59, 249-250 

merf, 83, 250 

merfs, 101, 251 

merxr, 81, 251 

memory, 142 

memory cache optimiza- 
tion, 154 

mfcr, 251 

mfctr, 80, 252 

mffs[.], 101, 252 

mflr, 80, 252 

mfmsr, 252 

mfspr, 56, 253 

mfsr, 256, 314 

mfsrin, 256, 314 

mftb, 256-257 

mftbu, 257 

mfxer, 56, 258 

move assist (integers), 
66-67 

move to/from branch 
special-purpose reg- 
isters, 82 


move to/from special- 
purpose registers, 56-57 

mr[.], 42, 258 

mtcr, 258 

mtcerf, 81, 259 

mtctr, 80, 259 

mtfsbO[.], 101, 259 

mtfsb1[.], 101, 260 

mtfsf].], 100, 260 

mtfsfi[.], 100, 260 

mutlr, 80, 261 

mtmsr, 261 

mtspr, 56, 261 

mtsr, 262, 314 

mtsrin, 262, 314 

mrxer, 56, 262 

mulhd[u][.], 36, 262-263 

mulhw[u]|[.], 35, 263-264 

mulld[o] [.], 36, 264 

mullhw[u] LJ, 36 

mulli, 35-36, 264 

mullw[o][.], 35-36, 265 

NAND (Boolean opera- 
tion), 41, 265 

neg[o][.], 265 

nop, 42, 266 

NOR (Boolean operation), 
41, 266 

NOT (Boolean operation), 
41, 266 

OR (Boolean operation), 
40, 267 

orc[.], 267 

ori, 267 

oris, 268 

out-of-order execution, 5 

pipelining, 134-135 

POWER instruction set 
architecture, 6 

prediction schemes 
(hardware), 73 

privileged MSR, 305 

register addresses, 39 

relative branch, 70 

reservations, 326-327 

rfi, 268 

RISC, 141-142 

rldcl[.], 54, 268 

tldcr[.], 54, 269 


ridic[.], 53, 269 

ridicl[.], 53, 269 

ridicr[.], 53, 270 

rldimi[.], 54, 270 

rlwimi[.], 50, 271 

rlwinm|[.], 49, 271 

rlwnm[.], 271 

rotate (registers), 49-56 

rotld[.], 272 

rotldi[.], 272 

rotlw, 50 

rotlw[.], 272 

rotlwi[.], 50, 273 

rotrdi[.], 273 

rotri[.], 50 

rotrwil.|, 274 

sc, 274 

shift (registers), 44-48 

sign extend operation 
(integers), 42 

signed shift (registers), 
46-48 

slbia, 274 

slbie, 274 

sld[.], 46, 275 

sldi[.], 275 

slw[.], 45, 276 

slwil.], 46, 276 

special load/store (inte- 
gers), 63 

srad[.], 48, 276 

sradi[.], 48, 277 

sraw[.], 46, 277 

srawi[.], 46, 278 

srd[.], 46, 278 

srdi[.], 278 

srw[.], 45, 279 

srwil.], 46, 279 

staus and control register, 
100-102 

stb[u] [x], 62, 280 

stBT, 281 

std[u] [x], 63, 281-282 

stdcx., 281 

stfd[u] [x], 91, 283-284 

stfiwx, 285 

stfs, 285 

stfs[u] [x], 91, 285-286 

sth[u] [x], 62, 286-288 





Lf[u][x] instruction @ 


sthbrx, 64, 287 byte reversal instruc- Ibz (load byte and zero- 
stmw, 64, 288 tions, 63 extend immediate) 

stswi, 66, 289 converting to/from instruction, 235 

stswx, 66, 289 floating-point numbers, Ibz[u] [x] (load byte and zero 
stw[u] [x], 62, 289-290 98 extend) instruction, 59 
stwbrx, 64, 290 count leading zero Ibzu (load byte and zero- 
sub[o][.], 33, 292 instructions, 42 extend immediate with 
subc[o][.], 33, 292 divide instructions, 37-39 update) instruction, 236 
subf[o][.], 33, 292 instructions, 30 Ibzux (load byte and zero- 
subfc[o][.], 33, 293 load, 59-61 extend with update) 
subfe[o][.], 33, 293 logical operation, 39-44 instruction, 236 

subfic, 33, 294 lmw, 246 Ibzx (load byte and zero- 
subfme[o][.], 33, 294 move assist, 66-67 extend) instruction, 236 
subfze[o][.], 33, 294 mr|.], 42 Id (load doubleword 

subi, 32, 295 multiply , 34-37 immediate) instruc- 
subic[.], 33, 295 neg[o][.], 265 tion, 237 


subis, 32, 295 


overflow bits, 149 


Id[u] [x] (load doubleword) 








Boolean operation 
instructions, 40-44 


latency (memory), 12 


sync, 296 sign extend operation, 42 instruction, 61 
system call, 84-87 special load/store, 63 Idarx (load doubleword and 
td, 296 store, 61-63 reserve) instruction, 237 
tdi, 296 subtract, 32-34 Idu (load doubleword 
tlbia, 298 processor data sizes, 59 immediate with update) 
tlbie, 299 two’s complement instruction, 238 
tlbsync, 299 representation, 19 Idux (load doubleword with 
trap, 23, 298 Intel 80x86 processor update) instruction, 238 
trap call, 84-87 portability, 352 Idx (load doubleword) 
tw, 299 interrupts instruction, 238 
twi, 300 addresses, 23 leaf functions, 115 
unconditional branch, 70 architecture, 23, 330-331 Ifd (load double precision 
unsigned shift (registers), programs, 23 floating-point immediate) 
45-46 system call, 23 instruction, 239 
XNOR (Boolean opera- vector locations, 331 Ifd[u] [x] (load floating-point 
tion), 41 invalidating cache blocks, double) instructions, 90 
XOR (Boolean operation), 211, 233 Ifdu (load double precision 
41, 300 isync (instruction synchro- floating-point immediate 
xori, 301 nize) instruction, 235 with update) instruc- 
xoris, 301 iteration variables (count tion, 239 
stwcx., 291 registers), 73 Ifdux (load double precision 
integer processor architec- floating-point with update) 
ture, 19 J-K-L instruction, 240 
integers Ifdx (load double precision 
64-bit arithmetic symbols, jumps (switch variables), 81 floating-point) instruc- 
171-181 tion, 240 
add instructions, 30-32 — wre = Ifs (load single precision 
arrays, storing in is dead re aeads lacetice: floating-point immediate) 
memory, 73 tion, 235 instruction, 240 


Ifs[u] [x] (load floating-point 
single) instructions, 90 








M@ = /fsu instruction 


Ifsu (load single precision 
floating-point immediate 
with update) instruc- 
tion, 241 

Ifsux (load single precision 
floating-point with update) 
instruction, 241 

Ifsx (load single precision 
floating-point) instruc- 
tion, 241 

lha (load halfword algebraic 
immediate) instruc- 
tion, 242 

lha[u] [x] (load halfword 
algebraic) instructions, 59 

lhau (load halfword 
algebraic immediate with 
update) instruction, 242 

lhaux (load halfword 
algebraic with update) 
instruction, 242 

lhax (load halfword alge- 
braic) instruction, 243 

lhbrx (load halfword and 
reverse bytes) instruction, 
63, 243 

lhz (load half-word and 
zero-extend immediate) 
instruction, 244 

lhz[u] [x] (load halfword and 
zero) instruction, 59 

lhzu (load half-word and 
zero-extend immediate 
with update) instruc- 
tion, 244 

lhzux (load half-word and 
zero-extend with update) 
instruction, 244 

lhzx (load half-word and 
zero-extend) instruc- 
tion, 245 

li (load immediate) instruc- 
tion, 245 

library routines (programs), 
optimizing, 131 

LIFO (Last In First Out) 
stacks, 109-113 

lines (memory caches), 152 


link area (stack frames), 
110-111 
link registers, 15, 23, 70, 
80-82 
linkage conventions, 
106-119 
linking subroutines, 70 
lis (load immediate shifted) 
instruction, 246 
little endian, see endian 
Imw (integer load multiple 
word) instruction, 64, 246 
load and store instructions 
memory, 57-67 
special integer opera- 
tions, 63 
load floating-point double 
(lfd[u][x]) instruction, 90 
load floating-point single 
(lfs[u][x]) instruction, 90 
load instructions (integers), 
59-61 
load/store instruction set 
architecture, 11 
loading 
32-bit single-precision 
floating-point numbers 
into general floating- 
point registers, 90 
64-bit double-precision 
floating-point numbers 
into general floating- 
point registers, 90 
GPRs with POWER 
instructions, 143 
local stack frames, 111-112 
lock function, 329 
logical instructions 
conditional registers, 
82-83, 137 
NOR (nor[.]), 266 
NOT (not[.]), 266 
operations, 39-44 
OR (or[.]), 267 
OR immediate (ori), 267 
OR shifted immediate 
(oris), 268 


OR with compliment 
(orc[.]), 267 
XOR (xor[.]), 300 
XOR immediate 
(xori), 301 
XOR shifted immediate 
(xoris), 301 
loops 
closing, 73 
control variables, 145 
count registers 
controlling with, 146 
iterations, 138-139 
PowerPC architecture, 
135-136 
Iswi (load string word 
immediate) instruction, 
66, 246 | 
Iswx (load string word 
indexed) instruction, 
66, 247 
lwa (load word algebraic 
immediate) instruc- 
tion, 247 
lwa[[u]x] (load word 
algebraic) instruction, 60 
lwarx (load word and 
reserve) instruction, 247 
lwaux (load word algebraic 
with update) instruc- 
tion, 248 
lwax (load word algebraic) 
instruction, 248 
lwbrx (load word and 
reverse bytes) instruction, 
63, 248 
lwz (load word and zero- 
extend immediate) 
instruction, 249 
lwz[u] [x] (load word and 
zero) instruction, 59 
lwzu (load word and zero- 
extend immediate with 
update) instruction, 249 
lwzux (load word and zero- 
extend with update) 
instruction, 250 
lwzx (load word and zero- 
extend) instruction, 250 





machine state register 
(MSR), 304-306, 338 
mantissas (floating-point 
numbers), 20 
matrix multiply functions, 
187-188 
mcerf (move condition 
register field) instruction, 
83, 250 
mcrfs (move FPSCR to 
condition register) 
instruction, 101, 251 
mcrxr (move XER to 
condition register) 
instruction, 81, 251 
measuring program perfor- 
mance with development 
tools, 131-133 
nemory 
32-bit page translation 
mechanism, 310-313 
addresses, 17-18 
formatting via instruc- 
tions, 142 
incrementing, 145 
BAT mechanism, 313-314 
boundaries, 154 
bytes (endian storage), 
58-59 
caches 
alignment, 154 
optimizing, 154 
processors, 150-155 
properties, 312 
TLBs, 319 
DRAM, 12 
instructions, 142 
latency, 12 
load and store instructions, 
57-67 
page tables, 315-318 
pages, 311-313 
reference alignment, 143 
registers, 13-15 
virtual, 15-17, 309-310 





32-bit management, 
314-319 
64-bit architecture, 
319-320 
memory reference ([x]), 24 
mfcr (move from condition 
register) instruction, 251 
mfctr (move from count 
register) instruction, 
80, 252 
mffs[.] (move from FPSCR) 
instruction, 101, 252 
mflr (move from link 
register) instruction, 
80, 252 
mfmsr (move from machine 
state register) instruc- 
tion, 252 
mfspr (move from special- 
purpose register) instruc- 
tion, 56, 253 
mfsr (move from segment 
register) instruction, 
256, 314 
mfsrin (move from segment 
register indirect) instruc- 
tion, 256, 314 
mftb (move from time base 
register) instruction, 
256-257 
mftbu (move from time base 
register upper) instruc- 
tion, 257 
mfxer (move from XER) 
instruction, 56, 258 
microprocessors, 11 
minus sign (-), extended 
mnemonics, 77 
misses (memory caches), 
152, 153 
mnemonics 
branch instructions, 
372-373 
extended 
64-bit rotation instruc- 
tions, 55-56 
BO field encodings, 
74-75, 197-198 


mtsrin instruction @ 


condition register bit 
encodings, 76-77, 
198-199 
moving between count/ 
link registers, 80 
TO field encodings, 
86-87, 297 
trap instructions, 
87, 298 
Motorola 680x0 processor 
portability, 352 
move assist instructions 
(integers), 66-67 
mr[.] (integer register move) 
instruction, 42, 258 
MSR (machine state 
register), 304-306, 338 
mtcr (move to condition 
register) instruction, 258 
mtcrf (move to condition 
register field) instruction, 
81, 259 
mtctr (move to count 
register) instruction, 
80, 259 
mtfsbO[.] (reset FPSCR bit) 
instruction, 101, 259 
mtfsb1[.] (set FPSCR bit) 
instruction, 101, 260 
mtfsf].] (move to FPSCR 
fields) instruction, 
100, 260 
mtfsfi[.] (move to FPSCR 
field immediate) instruc- 
tion, 100, 260 
mtlr (move to link register) 
instruction, 80, 261 
mtmsr (move to machine 
state register) instruc- 
tion, 261 
mtspr (move to special 
purpose register) instruc- 
tion, 56, 261 
mtsr (move to segment 
register) instruction, 
262, 314 
mtsrin (move to segment 
register indirect) instruc- 
tion, 262, 314 








@ omtxer instruction 


mtxer (move to XER) 
instruction, 56, 262 

mulhd[.] (multiply integer 
doubleword, return high 
doubleword) instruc- 
tion, 262 

mulhd[u][.] (multiply high 
doubleword) instruc- 
tion, 36 

mulhdu[.] (multiply integer 
doubleword unsigned, 
return high doubleword) 
instruction, 263 

mulhw[.] (multiply integer 
word, return high word) 
instruction, 263 

mulhw[u][.] (multiply high 
word) instruction, 35 

mulhwu[.] (multiply integer 
word unsigned, return 
high word) instruc- 
tion, 264 

mulld[o][.] (multiply integer 
doubleword, return low 
doubleword) instruc- 
tion, 36, 264 

mullhw/[u][.] (multiply high 
word) instruction, 36 

mulli (multiply integer 
immediate, return low 
word) instruction, 
35-36, 264 

mullw[o][.] (multiply 
integer word, return low 
word) instruction, 
35-36, 265 

multiply instructions 

multiplying floating-point 
register contents, 94-95 

multiprogramming systems, 
325-329 


N 


NaN (not a number), 
floating-point numbers, 21 

NAND (Boolean operation) 
instruction, 41, 265 





neg[o][.] (integer negate) 
instruction, 265 

nonvolatile registers, 
107-109 

nop (no operation) instruc- 
tion, 42, 266 

NOR (Boolean operation) 
instruction, 41, 266 

NOT (Boolean operation) 
instruction, 41, 266 

notations, 24-25, 190-191 

null bytes (C strings), 147 


0 


Oo, instruction mnemonics, 
30 | 
on-chip cache memory, 150 
operands (condition 
registers), 75-78 
operating systems, 7-8 
development tools, 
130-132 
measuring program 
performance, 132 
passing program control 
to, 84-87 
operators (symbols), 24-25 
optimizing 
algorithms, 131 
library routine pro- 
grams, 131 
memory caches, 151, 154 
processors, 133-149 
programming, 130 
programs, 133-149 
system routines, 131 
text references (re- 
sources), 382 
OR (Boolean operation) 
instruction, 40, 267 
orc[.] (logical OR with 
compliment) instruc- 
tion, 267 
ordering conventions 
(endian), 354-355 
ori (logical OR immediate) 





instruction, 267 

oris (logical OR shifted 
immediate) instruc- 
tion, 268 

out-of-order execution 
(instructions), 5 

overflow bits (registers), 
148-149 


P 


page tables (memory), 
310-318 
parallel pipelining, 134 
passing 
arguments (stack 
frames), 111 
program control to 
operating systems, 84-87 
PCBs (printed circuit 
boards), 12 
performance 
architecture, 10-11 
programs, 131-133 
text references (re- 
sources), 382 
pipelined superscalar 
machines, 133 
pipelining instructions, 
134-135 
plus sign (+) 
branch instruction 
mnemonics, 141 
extended mnemonics, 77 
popping stacks (functions), 
112-115 
portability (processors), 
352-353 
POWER architecture 
IBM), 143 
POWER instruction set 
architecture, 4-6 
PowerOpen ABI, 8, 106 
PowerPC 
architecture, 13-23 
branch instructions, 
135-136 
loops, 135-136 











implementation 
devices, 353 
PowerPC Reference 
Platform (PReP), 7 
prediction schemes (branch 
instructions), 136 
dynamic, 141 
hardware, 73 
static, 140-141 
PReP (PowerPC Reference 
Platform), 7 
printed circuit boards 
(PCBs), 12 
privileged instructions, 305 
processors 
601, 5 
603, 5 
620, 6 
architecture, 6-7 
floating-point, 90, 149- 
150 
implementations, 133 
instructions 
execution time, 133 
out-of-order execution, 5 
pipelining, 134, 135 
integer data sizes, 59 
Intel 80x86 port- 
ability, 352 
memory 
boundaries, 154 
caches, 150-155 
instructions, 142 
Motorola 680x0 portabil- 
ity, 352 
multiple synchronizations, 
325-329 
optimizing, 133-149 
PCBs, 12 
portability, 352-353 
registers, 13-15 
runtime environment 
symbols, 170-171 
profilers, 130-131 
program counters (instruc- 
tion addresses), 70 
programming, 130 


programming languages 
assembly, 376 
C (strings), 147 
overflow/carry bits, 
148-149 
programs 
code optimization, 131, 
144-148 
control flow, 14 
passing to operating 
systems, 84-87 
sequential, altering, 135 
data flow, 14 
data shifts, 78 
instructions 
branch, 70-77 
compare, 78-80 
control flow, 70 
interrupts, 23 
library routines, optimiz- 
ing, 131 
loops 
closing, 73 
control variables, 145 
controlling with count 
registers, 146 
optimizing, 133-149 
performance, 131-132 
strings 
Boyer-Moore searches, 
183-187 
hashing, 182 
symbols, 160-163 
system routines, 131 
timing symbols, 168-170 
prolog functions (stacks), 
112-115 
properties (memory pages), 
311-313 
pseudo operations (assembly 
PTE (page table entry), 
memory, 316 
modification operations, 
323-324 
synchronization require- 
ments, 322 
pushing stacks (functions), 
112-115 


registers 


QR 


reading FPSCR, 101 
recoding assembly language 
routines, 131 
record option (integer 
instructions), 138 
reduced instruction set 
complexity, see RISC 
references (memory), 143 
register set architecture, 
374-375 
registers, 13-15 
addresses, 17 
byte ordering con- 
ventions, 354 
condition, 14 
bit codes, 76, 82 
branch instructions, 
136-139 
comparison instruc- 
tions, 137 
copying XER high-order 
bits, 81 
data entry, 78-80 
extended mnemonics (bit 
codes), 76-77 
field encodings, 78 
field operand, 75 
fields, 96, 101 
logical instructions, 
82-83, 137 
updating, 73 
contents, addressing 
memory, 18 
count, 15, 23 
controlling loops, 146 
iteration variables, 73 
loop iterations, 138-139 
moving values to/from 
link register, 80-82 
floating-point, 14, 22, 90 
arithmetic operations 
(addition/subtraction), 
93-95 
contents, moving, 92 
FPSCR, 15, 100-102, 
337-339 














registers 


function addresses, 106 
general floating-point, 91 
general integer, 81 
GPRs, 6, 14 
implicit targets, 30 
link, 15, 23, 70, 80-82 
MSR, 304-306, 338 
nonvolatile, 107-109 
overflow/carry bits, 
148-149 
rotate instructions, 49-56 
segments, 310, 314 
synchronization 
requirements, 322 
updating, 323 
shift instructions, 44-48 
SPR, 15, 56-57, 253-255, 
306-309 
stacks, 109-113 
status, 149 
status and control, 14 
TBR, 257 
usage conventions, 
107-109 
volatile, 107-109 
XER, 14, 149 
relative branch instruc- 
tions, 70 
replacing 
memory caches, 152 
variable values, 328 
reservations (instructions), 
326-327 
reset FPSCR bit instruction 
return branch instruc- 
tions, 70 
returning variable 
values, 327 
rfi (return from interrupt) 
instruction, 268 
RISC (Reduced Instruction 
Set Complexity), 4, 
141-142 
rotate instructions (regis- 
ters), 49-56, 268-274 
rounding floating-point 
numbers, 21-22, 97-98 


routines 
assembly language, 131 
library, 131 
string copy, 144 
system, 131 
runtime environments 
(processors), 170-171 
Rx (register reference), 24 


§ 


saving general floating-point 
register contents, 91 
sc (system call) instruction, 
84, 274 
segment registers, 310, 314 
direct store segments, 320 
synchronization require- 
ments, 322 
updating, 323 
segment table entries, see 
STE 
segments 320 
sequential addresses (branch 
instructions), 70 
sequential flow control 
(programs), 135 
set FPSCR bit (mtfsb1[.]) 
instruction, 101 
set-associativity (memory 
caches), 152 
sharing index variables, 
144-145 
shift instructions (registers), 





shifts (program data ), 78 

sign extend byte (extsb[.]) 
instruction, 42 

sign extend halfword 
(extsh[.]) instruction, 
42, 219 

sign extend operation 
instructions (integers), 42 

sign extend word (extsw[.]) 
instruction, 42, 219 

sign-extend byte (extsb[.]) 

’ instruction, 218 

signed bytes, 59 


signed comparisons (condi- 
tion register operands), 78 

signed shift instructions 
(registers), 46-48 

simple subtract (sub[o]|[.]) 
instruction, 33 

slbia (SLB invalidate all) 
instruction, 274 

slbie (SLB invalidate entry) 
instruction, 274 

sld[.] (shift left doubleword) 
instruction, 46, 275 

sldi[.] (shift left doubleword 
immediate) instruc- 
tion, 275 

slw[.] (shift left word) 
instruction, 45, 276 

slwi[.] (shift left word 
immediate) instruction, 
46, 276 

special load/store instruc- 
tions (integers), 63 

special purpose registers, 15, 
56-57, 306-309 

split caches (memory), 152 

SPR (Special-Purpose 
Registers), 109, 253-255, 
306-309 

srad[.] (shift right algebraic 
doubleword) instruction, 
48, 276 

sradi[.] (shift right algebraic 
doubleword immediate) 
instruction, 48, 277 

sraw|.] (shift right algebraic 
word) instruction, 46, 277 

srawil.] (shift right algebraic 
word immediate) instruc- 
tion, 46, 278 

srd[.] (shift right 
doubleword) instruction, 
46, 278 

srdi[.] (shift right 
doubleword immediate) 
instruction, 278 

srw|.] (shift right word) 
instruction, 45, 279 





srwi[.] (shift right word 
immediate) instruction, 
46, 279 
stacks 
frames, 110-112 
aligning, 110 
creating, 114-115 
destroying, 114-115 
link area, 110-111 
local, 111-112 
passing arguments, 111 
register save areas, 112 
functions 
epilog, 112-115 
leaf, 115 
prolog, 112-115 
registers, 109-113 
static prediction schemes 
(branch instructions), 
140-141 
status and control reg- 
isters, 14 
status registers, 149 
stb (store byte immediate) 
instruction, 280 
stb[u] [x] (store byte) 
instruction, 62 
stBT (store byte) instruc- 
tion, 281 
stbu (store byte immediate 
with update) instruc- 
tion, 280 
stbux (store byte with 
update) instruction, 280 
std (store doubleword 
immediate) instruc- 
tion, 281 
std[u] [x] (store doubleword) 
instruction, 63, 282 
stdcx. (conditional store 
doubleword) instruc- 
tion, 281 
stdu (store doubleword 
immediate with update) 
instruction, 282 
stdx (store doubleword) 
instruction, 283 
STE (segment table 
entries), 324 


stfd (store double precision 
floating-point immediate) 
instruction, 283 

stfd(u] [x] (store floating- 
point double) instruction, 
91, 284 

stfdu (store double precision 
floating-point immediate 
with update) instruc- 
tion, 284 

stfdx (store double precision 
floating-point) instruc- 
tion, 284 

stfiwx (store floating-point 
as integer word) instruc- 
tion, 285 

stfs (store single-precision 
floating-point immediate) 
instruction, 285 

stfs[u] [x] (store floating- 
point single) instruction, 
91, 286 

stfsu (store single-precision 
floating-point immediate 
with update) instruc- 
tion, 285 

stfsx (store single-precision 
floating-point) instruc- 
tion, 286 

sth (store integer halfword) 
instruction, 286 

sth[u] [x] (store halfword) 
instruction, 62, 288 

sthbrx (store halfword byte- 
reverse indexed) instruc- 
tion, 64 

sthbrx (store integer 
halfword wth bytes 
reverse) instruction, 287 

sthu (store half-word 
immediate with update) 
instruction, 287 

sthx (store half-word) 
instruction, 288 

stmw (store multiple integer 
word) instruction, 64, 288 

storage classes (assembly 
language), 378 


subroutines @ 


store floating-point double 
(stfd[u] [x]) instruction, 91 
store floating-point single 
(stfs[u][x]) instruction, 91 
store instructions (integers), 
61-63 
storing 
cache blocks into 
memory, 212 
GPRs with POWER 
instructions, 143 
integer arrays in 
memory, 73 
string copy routines 
(assembly language, 144 
strings 
Boyer-Moore searches, 
183-187 
C programming lan- 
guage, 147 
hashing, 182 
stswi (store string word 
immediate) instruction, 
66, 289 
stswx (store string word 
indexed) instruction, 
66, 289 
stw (store integer word) 
instruction, 289 
stw[u] [x] (store word) 
instruction, 62, 290 
stwbrx (store integer word 
with bytes reversed) 
instruction, 64, 290 
stwcx. (conditional store 
word) instruction, 291 
stwu (store word immediate 
with update) instruc- 
tion, 290 
stwx (store word) instruc- 
tion, 291 
subroutine linkage conven- 
tions, 106-119 
subroutines 
addresses, 23 
calling, 70 
linking, 70 


unconditional calls, 71 


399 











M@ = subtract instructions 


subtract instructions 
(integers), 32-34, 292-295 
subtracting floating-point 
register contents, 93-94 
supervisor processing mode 
(PowerPC architec- 
ture), 19 
switch variables (jumps), 81 
symbols, 24-25, 190-191 
arithmetic, 171-181 
assembly language, 378 
C programming language, 
163-168 
program timing, 168-170 
programs, 160-163 
runtime environments 
(processors), 170-171 
sync (synchronize) instruc- 
tion, 296 
synchronizing multiple 
processors, 325-329 
system calls 
instructions, 84-87 
interrupts, 23 
system routines (programs), 
optimizing, 131 


T 


target addresses (branch 
instructions), 70 
TBR (Time Base Reg- 
ister), 257 
td (trap doubleword) 
instruction, 85, 296 
tdi (trap doubleword 
immediate) instruction, 
85, 296 
technological develop- 
ment, 4 
test and set function, 328 
text references (resources) 
architecture, 380 
optimizing, 382 
performance, 382 
three-stage pipelining, 134 
time command (UNIX), 133 





timing pro , 168-170 

tlbia (TLB invalidate all) 
instruction, 298 

tlbie (TLB invalidate entry) 
instruction, 299 

TLBs (translation lookaside 
buffers), 319 

tlbsync (TLB synchronize) 
instruction, 299 

TO field encodings (ex- 
tended mnemonics), 
86-87, 297 

trap call instructions, 84-87 

trap instructions, 23, 298 

tw (trap word) instruction, 
84, 299 

twi (trap word immediate) 
instruction, 84, 300 

two’s complement represen- 
tation (integers), 19 


U 


unconditional branch 
instructions, 70, 71 
unconditional long branch 
instruction (b[I][a]), 71 
unconditional subroutine 
calls, 71 
UNIX time command, 133 
unlock function, 329 
unsigned shift instructions 
(registers), 45-46 
update forms (load and store 
instructions), 57 
updating 
condition registers, 73 
indexed load and store 
instructions, 145-146 
segment registers, 323 
usage conventions (regis- 
ters), 107-109 
user processing mode 
(PowerPC architec- 
ture), 19 





V 


variables 
index, sharing, 144-145 
iteration (count reg- 
isters), 73 
switch, 81 
values 
incrementing, 328 
replacing, 328 
returning, 327 
vector locations (inter- 
rupts), 331 
virtual memory, 15-17, 
309-310 
32-bit management, 
314-319 
64-bit architecture, 
319-320 
volatile registers, 107-109 


W 


word (data sizes), 59 
writeback (memory 
caches), 152 


X-Y-Z 


[x] (memory reference), 24 
XER (exception register), 
14, 149 
high-order bits, copying to 
condition register, 81 
move to/from instruc- 
tions, 56 
XNOR (Boolean operation) 
instruction, 41 
XOR (Boolean operation) 
instruction, 41, 300 
xori (logical XOR immedi- 
ate) instruction, 301 
xoris (logical XOR shifted 
immediate) instruc- 
tion, 301 
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continued from front cover 


Iswi 
Iswx’ 
lwa 
lwarx 
lwaux 
lwax 
lwbrx 
lwz 
lwzu 


mfcr 
mfctr 
mffts 
mflr 
mfmsr 
mfspr 
mfsr 
mfsrin 
mftb 
mftb 
mftbu 
mfxer 
mr 
mtcr 
mtcrf 
mtctr 
mtfsb0 
mtfsb1 
mtfsf 
mtfsfi 
mtlr 
mtmsr 
mtspr 
mtsr 
mtsrin 
mtxer 
mulhd 
mulhdu 


mulhw 
mulhwu 
mulld 
mulli 
mullw 
nand 
neg 
nop 
nor 
not 

or 

orc 

ori 

oris 

rfi 
rldcl 
rldcr 
ridic 
ridicl 
ridicr 
rldimi 
rlwimi 
rlwinm 
tlwnm 
rotld 
rotldi 
rotlw 
rotlwi 


load string word immediate 

load string word indexed 

load word algebraic immediate 

load word and reserve 

load word algebraic with update 

load word algebraic 

load word and reverse bytes (indexed addressing form) 
load word and zero-extend immediate 

load word and zero-extend immediate with update 
load word and zero-extend with update 

load word and zero-extend 

move condition register field 

move FPSCR to condition register 

move XER to condition register 

move from condition register 

move from count register (CTR) 

move from FPSCR 

move from link register (LR) 

move from machine state register 

move from special purpose register 

move from segment register 

move from segment register indirect 

move from time base register 

move from time base register 

move from time base register upper 

move from integer exception register (XER) 
move register 

move to condition register 

move to condition register fields 

move to count register (CTR) 

move to FPSCR bit 0 (reset FPSCR bit) 

move to FPSCR bit 1 (set FPSCR bit) 

move to FPSCR fields 

move to FPSCR field immediate 

move to link register (LR) 

move to machine state register 

move to special purpose register 

move to segment register 

move to segment register indirect 

move to integer exception register (XER) 
multiply integer doubleword, return high doubleword 
multiply integer doubleword unsigned, return high 
doubleword 

multiply integer word, return high word 
multiply integer word unsigned , return high word 
multiply integer doubleword, return low doubleword 
multiply integer immediate, return low word 
multiply integer word, return low word 

logical NAND 

integer negate 

no operation 

logical NOR 

logical NOT 

logical OR 

logical OR with complement 

logical OR immediate 

logical OR shifted immediate 

return from interrupt 

rotate left doubleword, then clear left 

rotate left doubleword then clear right 

rotate left doubleword then clear 

rotate left doubleword immediate the clear left 
rotate left doubleword immediate the clear right 
rotate left doubleword immediate then insert mask 
rotate left word immediate then insert mask 
rotate left word immediate then AND with mask 
rotate left word then AND with mask 

rotate left doubleword 

rotate left doubleword immediate 

rotate left word 

rotate left word immediate 


rotrdi 
rotrwi 
sc 


slbia 
slbie 
sld 
sldi 
slw 
slwi 
srad 
sradi 
straw 
srawi 
srd 
srdi 
stw 
srwi 
stb 
stbu 
stbux 
stbx 
std 
stdcx 
stdu 
stdux 
stdx 
stfd 
stfdu 
stfdux 
stfdx 
stfiwx 
stfs 
stfsu 
stfsux 
stfsx 


sth 
sthbrx 
sthu 
sthux 
sthx 
stmw 
stswi 
stswx 
stw 
stwbrx 
stwu 
stwux 
stwx 
stwex 
sub 
subc 
subf 
subfc 
subfe 
subfic 
subfme 
subfze 
subi 
subis 
subic 
sync 
td 

tdi 
tlbia 
tlbie 
tlbsync 
tw 

twi 
xor 
xori 


rotate right doubleword immediate 

rotate right word immediate 

system call 

SLB invalidate all 

SLB invalidate entry 

shift left doubleword 

shift left doubleword immediate 

shift left word 

shift left word immediate 

shift right algebraic doubleword 

shift right algebraic doubleword immediate 
shift right algebraic word 

shift right algebraic word immediate 

shift right doubleword 

shift right doubleword immediate 

shift right word 

shift right word immediate 

store byte immediate 

store byte immediate with update 

store byte with update 

store byte 

store doubleword immediate 

conditional store doubleword 

store doubleword immediate with update 
store doubleword with update 

store doubleword 

store double-precision floating-point immediate 
store double-precision floating-point immediate with update 
store double-precision floating-point with update 
store double-precision floating-point 

store floating-point as integer word 

store single-precision floating-point immediate 
store single-precision floating-point immediate with update 
store single-precision floating-point with update 
store single-precision floating-point 

store integer halfword 

store integer halfword with ibytes reversed 
store half-word immediate with update 
store half-word with update 

store half-word 

store multiple integer word 

store string immediate 

store string 

store integer word 

store integer word with bytes reversed 
store word immediate with update 

store word with update 

store word 

conditional store word 

subtract 

subtract carrying 

integer subtract from 

integer subtract from carrying 

integer sutract from extended 

integer subtract from (immediate addressing, carrying) 
subtract from minus one (extended) 
subtract from (zero extended) 

subtract immediate 

subtract immediate shifted 

subtract immediate carrying 

synchronize 

trap doubleword 

trap doubleword immediate 

TLB invalidate all 

TLB invalidate entry 

TLB synchronize 

trap word 

trap word immediate 

logical XOR 

logical XOR immediate 

logical XOR shifted immediate 
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